datamade/data-making-guidelines

Name: data-making-guidelines

Owner: datamade

Description: :blue_book: Making Data, the DataMade Way

Created: 2015-04-13 20:46:34.0

Updated: 2018-05-21 03:36:34.0

Pushed: 2018-04-17 17:03:01.0

Homepage:

Size: 60

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Making Data, the DataMade Way

This is DataMade's guide to extracting, transforming and loading (ETL) data using Make, a common command line utility.

ETL refers to the general process of:

  1. taking raw source data (“Extract”)
  2. doing some stuff to get the data in shape, possibly involving intermediate derived files (“Transform”)
  3. producing final output in a more usable form (for “Loading” into something that consumes the data - be it an app, a system, a visualization, etc.)

Having a standard ETL workflow helps us make sure that our work is clean, consistent, and easy to reproduce. By following these guidelines you'll be able to keep your work up to date and share it with the world in a standard format - all with as few headaches as possible.

Basic Principles

These five principles inform all of our data work:

  1. Never destroy data - treat source data as immutable, and show your work when you modify it
  2. Be able to deterministically produce the final data with one command
  3. Write as little custom code as possible
  4. Use standard tools whenever possible
  5. Keep source data under version control

Unsure how to follow these principles? Read on!

The Guide
  1. Make & Makefile Overview
  2. Why Use Make/Makefiles?
  3. Makefile 101
  4. Makefile 201 - Some Fancy Things Built Into Make
  5. ETL Styleguide
  6. Makefile Best Practices
  7. Variables
  8. Processors
  9. Standard Toolkit
  10. ETL Workflow Directory Structure
Code examples
Further reading

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.