CD2H gitForager

Sage-Bionetworks/DREAM_DM_starter_code

Name: DREAM_DM_starter_code

Owner: Sage Bionetworks

Description: Some starter code for training/testing some basic CNN models given our data.

Forked from: yidarvin/DREAM_DM_starter_code

Created: 2016-09-08 20:17:27.0

Updated: 2016-09-08 20:17:28.0

Pushed: 2016-09-02 22:29:59.0

Homepage: null

Size: 23

Language: Python

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

DREAM_DM_starter_code

This is some starter code for the DREAM Digital Mammography challenge. This directory will give examples of how to Dockerize your code, implement a working data I/O pipeline, and some basic examples of Deep Learning with the challenge data. The python code utilizes the Tensorflow 0.9.0 framework and provides four popular CNN architectures:

Dependencies

You must have docker or nvidia-docker running on your system to be able to build the docker image.

Docker

As shown on the submission wiki page for the DREAM challenge, you can build and push the docker image with the following three lines:

er build -t docker.synapse.org/syn123456/dm-trivial-model .
er login -u <synapse username> -p <synapse password> docker.synapse.org
er push docker.synapse.org/syn123456/dm-trivial-model

where “syn123456” is the ID of the Synapse project where your team collaborates on this challenge. You can find more information on the wiki page.

Usage

The python program takes care of training and testing a model (saving a model when necessary). We have strived to create an extremely modular program, and thus, just from the flags you should be able to capture most of the hyperparameter variance that a vanilla convolutional network can use. These are the flags:

--pf: The path to the data that must be read in. The default is set to "/trainingData", but this must be changed to "/scoringData" for when you run the test script.
--csv1: The first metadata file for training. The default is correctly set to "/metadata/images_crosswalk_SubChallenge1.tsv".
--csv2: The second metadata file for training. The default is correctly set to "/metadata/exams_metadata_SubChallenge1.tsv".
--csv3: The first and only metadata file for testing. The default is correctly set to "/scoringData/image_metdata.tsv".
--net: The network architecture to use. There are primarily four archtectures as listed above with the names: Le, Alex, VGG16, and GoogLe. The default is set to GoogLe Net.
--lr: The learning rate. The default is set to 0.001.
--reg: The L2 regularization parameter. The default is set to 0.00001.
--out: An output textfile. This is depreciated and should be taken out.
--outtxt: The path of your output textfile that will be scored by the pipeline. The default is correctly set to "/output/out.txt".
--saver: The path of the model to save. The default is set to a correct path "/modelState/model.ckpt", but you can change the name of the checkpoint file if you please.
--decay: The learning rate decay. Default is set to 1.0, ergo, no decay.
--dropout: The chance of keeping a neuron during the layers of dropout. Default is set to 0.5.
--bs: The batchsize. On a single 12GB GPU, Alex Net can handle more than 200, VGG16 can handle about 15, and GoogLe net can handle about 80. Le Net is small, so it can handle more than 1000. Default is set to 10.
--epoch: This will set the maximum number of epochs (when each image has statistically been seen once. The default is set to 10.
--test: If set to anything greater than 0, the program will run a test time loop, and do no training while producing the output file. Default is set to 0.
--ms: The matrix size the input should be resized to. The default is set to 224, but for Le Net, you should use a size of 32.
--time: The maximum amount of time (in minutes) that you want the training to run for. The default is set to 1000000, forcing the ending criteria to be decided by the epoch count.

Note: The program will take into account both the time and epoch parameters and stop the program at the more conservative upper bound.

To submit to the challenge, you need to submit a training and testing script. This can be done with a single line with our code. For example, the training line

on DREAM_DM_pilot_tf.py --net Le -- ms 32 --lr 0.0001 --decay 0.985 --bs 10 --epoch 2 --dropout 0.5

and the training line

on DREAM_DM_pilot_tf.py --net Le --ms 32 --test 1

Note: As the test script will depend on the model that is saved by the training script, you MUST define the same network and matrix size between both the training and testing scripts for this pipeline to work.

Advanced Usage

The module general_conv lets you define your own convolution layer. Basically, since every convolutional neural network can be defined by an extremely simple set of numbers, this program will take in those numbers and give you the corresponding network. You can feed in your customized CNN architecture to general_conv through the list of list variable architecture_conv. This will either be a list of 3 numbers (filter size, filter count, and stride) for a convolution layer or a list of two numbers (0 followed by the pooling size) for a max pool layer. For example, below, I can use these modules to quickly define and call Alex Net or VGG16 Net.

exNet with an input tensor X and 2 outputs.
itecture_Alex = [[11, 96, 4], [0, 2],
                 [11, 256, 1], [0, 2],
                 [3, 384, 1], [3, 384, 1], [3, 256, 1], [0, 2]]
r = general_conv(X, architecture_Alex)
r = dense(layer, 4096)
r = dense(layer, 4096)
_Alex = output(layer, 2)

G16 Net with an input tensor X and 2 outputs.
itecture_VGG16 = [[3, 64, 1], [3, 64, 1], [0, 2],
                  [3, 128, 1], [3, 128, 1], [0, 2],
                  [3, 256, 1], [3, 256, 1], [3, 256, 1], [0, 2],
                  [3, 512, 1], [3, 512, 1], [3, 512, 1], [0, 2],
                  [3, 512, 1], [3, 512, 1], [3, 512, 1], [0, 2]]
r = general_conv(X, architecture_VGG16)
r = dense(layer, 4096)
r = dense(layer, 4096)
_VGG16 = output(layer, 2)

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.