sunlightlabs/mlscrape

Name: mlscrape

Owner: Sunlight Labs

Description: mlscrape is a library for site-specific automated website scraping based on human-annotated examples

Created: 2015-05-08 22:14:18.0

Updated: 2015-08-21 09:55:12.0

Pushed: 2015-05-19 17:36:37.0

Homepage:

Size: 124

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

mlscrape

mlscrape is a library for site-specific automated website scraping based on human-annotated examples. It contains two types of models: one for learning to distinguish between pages of interest and uninteresting pages, and one for identifying elements of interest within target pages. It works best for websites where there are good clues in the DOM that both distinguish between interesting and uninteresting pages, and between interesting and uninteresting DOM nodes.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.