CD2H gitForager

uwescience/DSSG2016-UnsafeFoods

Name: DSSG2016-UnsafeFoods

Owner: UW eScience Institute

Description: Project's website:

Created: 2016-06-21 17:00:10.0

Updated: 2017-10-01 17:28:40.0

Pushed: 2016-11-07 21:16:35.0

Homepage: https://uwescience.github.io/DSSG2016-UnsafeFoods/

Size: 19763

Language: Jupyter Notebook

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

Mining Online Data for Early Identification of Unsafe Food Products

Materials for the Mining Online Data for Early Identification of Unsafe Food Products project from the UW eScience Institute's Data Science for Social Good program.

Files

asins/ folder contains ASINs (Amazon Standard ID Numbers) corresponding to the UPCs in the upcs/ folder and asin_intersection.txt
asins/asin-intersection.txt: List of ASINs that appear in both FDA recall and amazon grocery and gourmet food review datasets
asins/asin-intersection_health.txt: List of ASINs that appear in both FDA recall and amazon health care review datasets
code/data-preprocessing.py: Functions for data processing
code/amz-reviews-to-strict-json.py: Code to convert the raw Amazon review file to strict JSON
code/amz-reviews-lda.R: Code to conduct LDA topic modeling and create interactive visualizations in R
code/enforcement-data-merge.R: Code to extract data on food products from the weekly FDA enforcement reports and generate one large CSV from the 200+ weekly CSVs.
notebooks/Fetching ASINs (FINALLY).ipynb: Code to gather all ASINs for a file of UPCs
notebooks/NLTK Workbook.ipynb: Notebook to create a corpus from the Amazon review data
notebooks/NMF_exploration.ipynb: iPython notebook that uses NMF to obtain topic results for subset of Amazon Review Data
notebooks/join_review-recall notebook.ipynb: iPython notebook that constructs dataframe of amazon reviews, product metadata and recall status from reviews_Grocery_and_Gourmet_Food.json.jz,meta_Grocery_and_Gourmet_Food.json.gz' andasin_intersection.txt`
upcs/ folder contains all of the UPCs from the FDA recalls, split into four files
github_data/ folder contains small data files to be stored on github (rather than being in the ignored data folder)
github_data/amazon_product_categories.csv: CSV file storing the amazon product category hierarchy. The first column contains each category name, and the second column contains the “parent” category”.

Data

The contents of data/ are ignored by git, but this is what it should contain:

data/raw/reviews_Grocery_and_Gourmet_Food.json.gz is from http://jmcauley.ucsd.edu/data/amazon/links.html – scroll down to “Per-category files” and select the Grocery and Gourmet Food reviews file. Note this is NOT the 5-core reviews file that appears under the “Files” header on the web page. This data file should have 1,297,156 reviews. This file is not strict JSON, but can be converted to strict JSON with amz-reviews-to-strict-json.py, which will output a file to the data/processed/ folder.
`data/raw/meta_Grocery_and_Gourmet_Food.json.gz' is is from http://jmcauley.ucsd.edu/data/amazon/links.html – scroll down to “Per-category files” and select the Grocery and Gourmet Food metadata file. This data file should have 171,760 products.
data/raw/FDA_recalls.xml – this is the FDA recall data in XML form. In theory, this data should be available from data.gov at https://catalog.data.gov/dataset/all-fda-recalls-1ae7b, however the link on that page is broken. We used the Wayback machine to access a previous version of this data here: https://web.archive.org/web/20150504011324/http://www.fda.gov/DataSets/Recalls/RecallsDataSet.xml. In R, this can be converted to a CSV with the following code (assumes that the XML data is saved as a file called FDA_recalls.xml):

nstall the XML package if it is not already installed
"XML" %in% installed.packages()) {
stall.packages("XML")

oad the XML package
ary("XML")

arse XML
<- xmlTreeParse("FDA_recalls.xml", useInternalNodes = TRUE)

onvert to data frame
<- xmlToDataFrame(doc)

rite to CSV
e.csv(dat, "../processed/FDA_recalls.csv", row.names = FALSE)

data/raw/FDA-enforcement/ is a folder that contains weekly FDA enforcement reports downloaded manually from http://www.accessdata.fda.gov/scripts/ires/index.cfm. This data goes back to mid 2012.
data/processed/FDA_food_enforcements_2012-06_to_2016-07.csv has data from the weekly enforcement reports for food products as one large file.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.