UDST/orca_test

Name: orca_test

Owner: Urban Data Science Toolkit

Description: Assertions for testing the characteristics of orca tables and columns

Created: 2016-07-19 19:19:25.0

Updated: 2016-10-28 18:42:27.0

Pushed: 2018-02-20 18:16:49.0

Homepage: null

Size: 44

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Orca_test

This is a library of assertions about the characteristics of tables, columns, and injectables that are registered in Orca.

The motivation is that UrbanSim model code expects particular tables and columns to be in place, and can fail unpredictably when data is not as expected (missing columns, NaNs, negative prices, log-of-zero). These failures are rare, but hard to debug, and can happen at any time because data is modified as models run.

Orca_test assertions can be included in model steps or used as part of the data preparation pipeline. The goal for this library is for it to be useful (1) as a model development aid, (2) for exception handling as simulations run, and (3) for documenting the data specs required by different UrbanSim templates.

Installation

Clone this repo and run python setup.py develop. Won't be of much use without Orca and some project that's using it for simulation orchestration.

Usage

You can either make assertions directly by calling individual orca_test functions, or assert a full set of characteristics at once. These characteristics are expressed as nested python classes (similar to sqlalchemy), and in the future will have an equivalent YAML syntax.

If an assertion passes, nothing happens. If it fails, an OrcaAssertionError is raised with a detailed message. Orca_test is written to be as computationally efficient as possible, and the main cost will be the generation of tables or columns that have not yet been cached.

Assertions are chained as necessary: for example, asserting a column's minimum value will automatically assert that it is numeric, that missing values are coded in a particular way (np.nan by default), that the column can be generated without errors, and that it is registered with orca.

Example
rt orca_test as ot
 orca_test import OrcaSpec, TableSpec, ColumnSpec

fine a specification
ec = OrcaSpec('my_spec',

TableSpec('buildings', 
    ColumnSpec('building_id', primary_key=True),
    ColumnSpec('residential_price', min=0, missing=False)),

TableSpec('households',
    ColumnSpec('building_id', foreign_key='buildings.building_id', missing_val_coding=-1)),

TableSpec('residential_units', registered=False),

InjectableSpec('rate', greater_than=0, less_than=1))

sert the specification
ssert_orca_spec(o_spec)
Working demos
API Reference

There's fairly detailed documentation of individual functions in the source code.

Classes
Asserting sets of characteristics
Table assertions

| Argument in TableSpec() | Equivalent low-level function | | —————— | ——————————– | | registered = True | assert_table_is_registered( table_name ) | | registered = False | assert_table_not_registered( table_name ) | | can_be_generated = True | assert_table_can_be_generated( table_name ) |

Column assertions

| Argument in ColumnSpec() | Equivalent low-level function | | —————— | ——————————— | | registered = True | assert_column_is_registered( table_name, column_name ) | | registered = False | assert_column_not_registered( table_name, column_name ) | | can_be_generated = True | assert_column_can_be_generated( table_name, column_name ) | | numeric = True | assert_column_is_numeric( table_name, column_name ) | | missing_val_coding = np.nan, 0, -1 | assert_column_missing_value_coding( table_name, column_name, missing_val_coding ) | | missing = False| assert_column_no_missing_values( table_name, column_name, optional missing_val_coding ) | | max_portion_missing = portion | assert_column_max_portion_missing( table_name, column_name, portion, optional missing_val_coding ) | | primary_key = True | assert_column_is_primary_key( table_name, column_name ) | | foreign_key = 'parent_table_name.parent_column_name' | assert_column_is_foreign_key( table_name, column_name, parent_table_name, parent_column_name, optional missing_val_coding ) | | max = value | assert_column_max( table_name, column_name, maximum, optional missing_val_coding) | | min = value | assert_column_min( table_name, column_name, minimum, optional missing_val_coding ) | | is_unique = True | assert_column_is_unique( table_name, column_name ) |

Notes

Providing a missing_val_coding in a ColumnSpec() indicates that there should be no np.nan values in the column. Assertions involving a min, max, or max_portion_missing will take into account the missing_val_coding that's been provided.

For example, asserting that a column with values [2, 3, 3, -1] has min = 0 will fail, but asserting that it has
min = 0, missing_val_coding = -1 will pass.

Injectable assertions

| Argument in InjectableSpec() | Equivalent low-level function | | —————————- | —————————– | | registered = True | assert_injectable_is_registered( injectable_name ) | | registered = False | assert_injectable_not_registered( injectable_name ) | | can_be_generated = True | assert_injectable_can_be_generated( injectable_name ) | | numeric = True | assert_injectable_is_numeric( injectable_name ) | | greater_than = value | assert_injectable_greater_than( injectable_name, value ) | | less_than = value | assert_injectable_less_than( injectable_name, value ) | | has_key = str | assert_injectable_has_key( injectable_name, str ) |

Development wish list
Sample YAML syntax (not yet implemented)
ca_spec:
name: my_spec

table_spec:
- name: buildings
- column_spec:
  - name: building_id
  - primary_key: True
- column_spec:
  - name: residential_price
  - min: 0
  - missing: False

table_spec:
- name: households
- column_spec:
  - name: building_id
  - foreign_key: buildings.building_id
  - missing_val_coding: -1

table_spec:
- name: residential_units
- registered: False

injectable_spec:
- name: rate
- greater_than: 0
- less_than: 1

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.