Name: observations
Owner: Edward
Description: Tools for loading standard data sets in machine learning
Created: 2017-06-11 02:31:09.0
Updated: 2018-05-24 07:22:22.0
Pushed: 2018-02-10 20:52:17.0
Size: 1700
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Observations provides a one line Python API for loading standard data sets in machine learning. It automates the process from downloading, extracting, loading, and preprocessing data. Observations helps keep the workflow reproducible and follow sensible standards.
It can be used in two ways.
Install it.
install observations
Import it.
observations import svhn
rain, y_train), (x_test, y_test) = svhn("~/data")
All functions take as input a filepath and optional preprocessing arguments. They return a tuple in the form of training data, test data, and validation data (if available). Each element in the tuple is typically a NumPy array, a tuple of NumPy arrays (e.g., features and labels), or a string (text). See the API for details.
Copy and paste functions inside the codebase relevant for your experiments.
enwik8(path):
.
ain, x_test, x_valid = enwik8("~/data")
Each function has minimal dependencies. For example,
enwik8.py
only depends on core libraries and
the external function maybe_download_and_extract
in
util.py
. The functions are designed to be easy to
read and hack at.
It depends on your use case.
The data loading functions return the full data. It's up to your needs to generate batches.
One helpful utility is
generator(array, batch_size):
"Generate batch with respect to array's first axis."""
art = 0 # pointer to where we are in iteration
ile True:
stop = start + batch_size
diff = stop - array.shape[0]
if diff <= 0:
batch = array[start:stop]
start += batch_size
else:
batch = np.concatenate((array[start:], array[:diff]))
start = diff
yield batch
To use it, simply write
observations import cifar10
rain, y_train), (x_test, y_test) = cifar10("~/data")
ain_data = generator(x_train, 256)
batch in x_train_data:
. # operate on batch
h = next(x_train_data) # alternatively, increment the iterator
There's also an extended version. It takes a list of arrays as input and yields a list of batches.
generator(arrays, batch_size):
"Generate batches, one with respect to each array's first axis."""
arts = [0] * len(arrays) # pointers to where we are in iteration
ile True:
batches = []
for i, array in enumerate(arrays):
start = starts[i]
stop = start + batch_size
diff = stop - array.shape[0]
if diff <= 0:
batch = array[start:stop]
starts[i] += batch_size
else:
batch = np.concatenate((array[start:], array[:diff]))
starts[i] = diff
batches.append(batch)
yield batches
To use it, simply write
observations import cifar10
rain, y_train), (x_test, y_test) = cifar10("~/data")
n_data = generator([x_train, y_train], 256)
x_batch, y_batch in train_data:
. # operate on batch
tch, y_batch = next(train_data) # alternatively, increment the iterator
We'd like your help! Any pull requests which help maintain the existing functions and/or add new ones are appreciated. We follow Edward's standards for style and documentation.
Each function takes as input a filepath and optional preprocessing arguments. All necessary packages that aren't from the Python Standard Library, NumPy, or six are imported inside the function's body. The functions proceed as follows: