biolab/Multicore-TSNE

Name: Multicore-TSNE

Owner: Bioinformatics Laboratory

Description: Parallel t-SNE implementation with Python and Torch wrappers.

Forked from: DmitryUlyanov/Multicore-TSNE

Created: 2017-10-13 12:08:38.0

Updated: 2017-10-13 12:08:40.0

Pushed: 2017-11-09 14:01:55.0

Homepage:

Size: 391

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Multicore t-SNE

This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with python and Torch CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core.

What to expect

Barnes-Hut t-SNE is done in two steps.

So when can you benefit from parallelization? It is almost true, that the second step computation time is constant of D and depends mostly on N. The first part's time depends on D a lot, so for small D time(Step 1) << time(Step 2), for large D time(Step 1) >> time(Step 2). As we are only good at parallelizing step 1 we will benefit most when D is large enough (MNIST's D = 784 is large, D = 10 even for N=1000000 is not so much). I wrote multicore modification originally for Springleaf competition, where my data table was about 300000 x 3000 and only several days left till the end of the competition so any speed-up was handy.

Benchmark

1 core

Interestingly, that this code beats other implementations. We compare to sklearn (Barnes-Hut of course), L. Van der Maaten's bhtsne, py_bh_tsne repo (cython wrapper for bhtsne with QuadTree). perplexity = 30, theta=0.5 for every run. In fact py_bh_tsne repo works at the same speed as this code when using more optimization flags for compiler.

This is a benchmark for 70000x784 MNIST data:

| Method | Step 1 (sec) | Step 2 (sec) | | —————————- |:—————:| ————–:| | MulticoreTSNE(n_jobs=1) | 912 | 350 | | bhtsne | 4257 | 1233 | | py_bh_tsne | 1232 | 367 | | sklearn(0.18) | ~5400 | ~20920 |

I did my best to find what is wrong with sklearn numbers, but it is the best benchmark I could do (you can find test script in python/tests folder).

Multicore

This table shows a relative to 1 core speed-up when using n cores.

| n_jobs | Step 1 | Step 2 | | ————- |:———:| ——–:| | 1 | 1x | 1x | | 2 | 1.54x | 1.05x | | 4 | 2.6x | 1.2x | | 8 | 5.6x | 1.65x |

How to use

Python and torch wrappers are available.

Python
Install

Make sure cmake is installed on your system.

To install the package please do:

clone https://github.com/DmitryUlyanov/Multicore-TSNE.git
ulticore-TSNE/
install --no-cache-dir .

It's important that you add --no-cache-dir otherwise pip won't copy the .so file which is needed at runtime.

For installation on MacOS, before running pip install --no-cache-dir ., you need to

Tested with both Python 2.7 and 3.6 (conda) and Ubuntu 14.04.

Run

You can use it as a drop-in replacement for sklearn.manifold.TSNE.

 MulticoreTSNE import MulticoreTSNE as TSNE

 = TSNE(n_jobs=4)
tsne.fit_transform(X)

Please refer to sklearn TSNE manual for parameters explanation.

Only double arrays are supported for now. For this implementation n_components is fixed to 2, which is the most common case (use Barnes-Hut t-SNE or sklearn otherwise). Also note that some of the parameters will be ignored for sklearn compatibility. Only these parameters are used (and they are the most important ones):

Test

You can test it on MNIST dataset with the following command:

on MulticoreTSNE/examples/test.py <n_jobs>
Note on jupyter use

To make the computation log visible in jupyter please install wurlitzer (pip install wurlitzer) and execute this line in any cell beforehand:

d_ext wurlitzer

Memory leakages are possible if you interrupt the process. Should be OK if you let it run until the end.

Torch

To install execute the following command from repository folder:

ocks make torch/tsne-1.0-0.rockspec

or

ocks install https://raw.githubusercontent.com/DmitryUlyanov/Multicore-TSNE/master/torch/tsne-1.0-0.rockspec

You can run t-SNE like that:

 = require 'tsne'

tsne(X, n_components, perplexity, n_iter, angle, n_jobs)

torch.DoubleTensor type only supported for now.

License

Inherited from original repo's license.

Future work

Citation

Please cite this repository if it was useful for your research:

c{Ulyanov2016,
thor = {Ulyanov, Dmitry},
tle = {Muticore-TSNE},
ar = {2016},
blisher = {GitHub},
urnal = {GitHub repository},
wpublished = {\url{https://github.com/DmitryUlyanov/Muticore-TSNE}},

Of course, do not forget to cite L. Van der Maaten's paper


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.