Name: hanuman
Owner: Sunlight Labs
Description: This project contains tools for doing machine-learning-driven scraping of the websites of lobbying firms.
Created: 2015-02-13 20:28:04.0
Updated: 2015-11-05 05:36:40.0
Pushed: 2015-07-08 21:59:59.0
Size: 572
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
The Invisible Influencers project focuses on finding information about people who fit the common-sense but not statutory definition of “lobbyist,” mainly by extracting information from the text of the staff biographies of such people posted on the websites of their employers. This repository houses various bits related to the retrieval of lobbyist bios:
chrome_app
contains a Chrome app that presents users with an interface for annotating lobbying firm websites to indicate which pages are bio pages, and which parts of the page are the person's name and bio text.data_collection
contains the Django app with which the Chrome app communicatesextraction
contains another Django app that generalizes based on the user input collected with the Chrome app, using machine learning to find more bio pages on a given firm's site based on a hand-collected sample, and to extract the content (names, bios) from those pages. This component depends on the nanospider repo for spidering, and the mlscrape repo for building machine learning models to recognize pages and relevant content.