thedataincubator/pydata2016

Name: pydata2016

Owner: The Data Incubator

Description: A couple projects using scikit-learn illustrating project decision making.

Created: 2016-10-08 13:36:13.0

Updated: 2017-10-16 17:31:09.0

Pushed: 2016-10-08 14:17:28.0

Homepage:

Size: 2871

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Practical Machine Learning

We offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production?data cleaning, feature engineering, model building and evaluation, and deployment?and diving into an application for anomaly detection and a personalized recommendation engine. All concepts will be presented with example code in Python.

Installation

All of the code we present uses Python 2.7. A number of libraries beyond the standard library are used. We recommend using the conda package manager to install the same versions that we are using, in a manner that won't interfere with your system packages.

Conda

You may install either the full Anaconda Package Manager or the smaller Miniconda system. The former will provide you with over 720 packages, ready to use; the latter will makes it easy to download them when needed. Once one of these are installed, you can install the packages we will be using into a separate environment with

nda env create -f environment.yml

This will create a new conda environment named pydata. It can be activated on Linux and OS X with

urce activate pydata

or on Windows with

tivate pydata
Data

All of the material will use real-world data sets. We recommend that you download them to your personal machine before the day of the workshop. Two applications will be presented.

Recommendation Engine

We will be using the MovieLens 10M data set, assembled by the University of Minnesota. The data are available in a single 63 MB zip file, available at http://files.grouplens.org/datasets/movielens/ml-10m.zip.

Anomaly Detection

We will be using data from the New York CitiBike program. This is available in a number of zip files at https://s3.amazonaws.com/tripdata/index.html. They can be easily downloaded with the provided script:

download.sh

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.