CD2H gitForager

TransparencyToolkit/CrawlerManager

Name: CrawlerManager

Owner: Transparency Toolkit

Description: API for calling crawlers

Created: 2015-11-20 21:24:22.0

Updated: 2017-10-23 01:18:27.0

Pushed: 2017-05-22 14:58:42.0

Homepage: null

Size: 85

Language: Ruby

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

CrawlerManager

API for running and managing crawlers and parsing results

CrawlerManager can be used in combination with Harvester web interface to run queries and load results.

Installing

Make sure you have the proper system dependencies with

On Debian, do the following:
sudo apt-get install sqlite3 libsqlite3-dev
Get the code git clone https://github.com/TransparencyToolkit/CrawlerManager
Install Ruby dependencies bundle install

Setup

Create the databases rake db:create:all
Reset existing databases rake db:reset

WARNING

Currently, for Harvester to save data, you need to have the path /home/user/Data/KG/ and /home/user/Data/KG/All_Pics/ to exist. This is kludgy and will be configurable soon!

Running CrawlerManager

Run the app by typing rails server -p 9506
Run a test crawl on public LinkedIn data for the term “xkeyscore”
Get details about specific crawler (e.g. Google)
List all available crawlers

Additional Configuration

To use proxies, set environment variable PROXYLIST to the path to the proxylist you want to use.

To solve CAPTCHAs, set environment variable SOLVERDETAILS to your 2Captcha key.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.