Name: analysis-py-utils
Owner: Verily Life Sciences
Description: Python utilities for BigQuery analyses.
Created: 2017-10-20 19:11:40.0
Updated: 2018-01-18 18:19:56.0
Pushed: 2017-11-09 19:24:56.0
Homepage: null
Size: 50
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Ensure that the tests are run in an environment authenticated to send
requests to BigQuery, such as Google Cloud Shell or a local machine with
gcloud installed and gcloud auth
application-default login
executed to configure the credential.
Install this Python package.
install git+https://github.com/verilylifesciences/analysis-py-utils.git
Set the environment variable for the Cloud Platform project in which the queries will run.
rt TEST_PROJECT=YOUR-PROJECT-ID
BQTestCase provides an implementation of the Python unittest unit testing framework for testing BigQuery queries. Tests that inherit from it will:
experimental It also provides a mock BigQuery framework for simple unit tests that cuts running time from tens of seconds to less than a second.
Here is a simple example test using BQTestCase:
rt unittest
google.cloud.bigquery.schema import SchemaField
verily.bigquery_wrapper import bq_test_case
s SampleTest(bq_test_case.BQTestCase):
# You always have to override this method in your tests that inherit from bq_test_case.
# The constructor of BQTestCase calls this method and will raise a NotImplementedError
# if it doesn't have a method body. You can create tables other places in your test, too,
# but you probably want to create tables here that are across multiple tests.
@classmethod
def create_mock_tables(cls):
# type: () -> None
"""Create mock tables"""
cls.src_table_name = cls.client.path('tmp')
cls.client.populate_table(
cls.src_table_name,
[SchemaField('foo', 'INTEGER'),
SchemaField('bar', 'INTEGER'),
SchemaField('baz', 'INTEGER')],
[[1, 2, 3], [4, 5, 6]])
@classmethod
def setUpClass(cls):
# type: () -> None
"""Set up class"""
# By default, the BQTestCase will use BigQuery to back its tests. Setting
# use_mocks to True will use mock_bq to back its tests. This makes the tests
# run much faster than they would if they used actual BigQuery. Of course,
# there are some limitations as to what the mock can handle.
super(SampleTest, cls).setUpClass(use_mocks=False)
# In your tests, use cls.client as your BigQuery client, and BQTestCase will
# figure out the rest!
def test_load_data(cls):
# type: () -> None
"""Test bq.Client.get_query_results"""
result = cls.client.get_query_results(
'SELECT * FROM `' + cls.src_table_name + '`')
cls.assertSetEqual(set(result), set([(1, 2, 3), (4, 5, 6)]))
_name__ == "__main__":
ittest.main()
This framework can also be used to test queries in external files.
As a concrete example, suppose you have a query in file data_counts.sql
that
uses Jinja for templating.
ndardSQL
Count the number of distinct foos and bars in the table.
CT
UNT(DISTINCT foo) AS foo_cnt,
UNT(DISTINCT bar) AS bar_cnt
{ DATA_TABLE }}`
To test it, you can create file data_counts_test.py
in the same directory with
the following contents.
rt unittest
google.cloud.bigquery.schema import SchemaField
jinja2 import Template
verily.bigquery_wrapper import bq_test_case
s QueryTest(bq_test_case.BQTestCase):
lassmethod
f setUpClass(cls):
super(QueryTest, cls).setUpClass(use_mocks=False)
lassmethod
f create_mock_tables(cls):
cls.src_table_name = cls.client.path("data")
cls.client.populate_table(
cls.src_table_name,
[SchemaField("foo", "STRING"),
SchemaField("bar", "INTEGER")],
[
["foo1", 0],
["foo2", 0],
["foo3", 10],
["foo1", 10]
]
)
f test_data_counts(self):
# Perform the Jinja template replacement.
sql = Template(
open("data_counts.sql", "r").read()).render(
{"DATA_TABLE": self.src_table_name})
result = self.client.get_query_results(sql)
self.assertSetEqual(set(result), set([(3, 2),]))
_name__ == "__main__":
ittest.main()
By default, BQTestCases will be run against BigQuery or you can explicitly pass
use_mocks=False
to your superconstructor call.
To run BQTestCases against a mock BigQuery (implemented in mock_bq.py) pass
use_mocks=True
to your superconstructor call. This makes tests run in less
than a second instead of the tens of seconds required for accessing actual
BigQuery tables.
The mock is backed by an in-memory SQLite database, so there are some caveats.
EXTRACT(YEAR from x)
function to extract the year from a
timestampFORMAT
function with the equivalent SQLite3 printf
functionCONCAT
operator with SQLite3's ||
operator for
concatenationAs you can imagine there is plenty of room for growth here. These are just the functions that have come up in testing so far.
There will be some BigQuery functions that just won't work in SQLite3. In that
case pass use_mocks=False
in your call to your class's superconstructor so
that the query is sent to BigQuery instead of SQLite3.
Whether you use the mock or not, it's a good idea to make sure your tests pass using actual BigQuery.
If you're having problems with your tests running against real BigQuery, it can be kind of tricky to debug because all your tables get deleted after the test is run. To preserve your tables, override the default implementations of the tearDown methods in you test.
f tearDown(self):
"""Disable clean up of tables after a test case is run."""
pass
lassmethod
f tearDownClass(cls):
"""Disable clean up of the dataset after a test suite is run."""
pass
The test tables have an expiration time of one hour so they will be deleted automatically after that time elaspses but the empty datasets will remain, so be sure to delete those manually.
Probably the most common problem you'll run into is trying to do things in your query that SQLite can't handle. If you try to do that, you'll get a RuntimeException that says:
tried to reformat your query into SQLite, but it still won't work.
k to make sure it was a valid query to begin with, then consider adding the transformation
ed to make it work in the _reformat_query method.
te error: (whatever error SQLite threw)
result query: (your query after we tried to transform it)
From there you can work through whatever solution you choose. There is a flag
passed into the _reformat_query
function called print_before_and_after
that's False by default, but if you set it to True, it will automatically print
out your query as it's passed in to the function, then again after we tried to
reformat it, so that might be helpful for debugging.