TransparencyToolkit/CrawlerManager

Name: CrawlerManager

Owner: Transparency Toolkit

Description: API for calling crawlers

Created: 2015-11-20 21:24:22.0

Updated: 2017-10-23 01:18:27.0

Pushed: 2017-05-22 14:58:42.0

Homepage: null

Size: 85

Language: Ruby

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CrawlerManager

API for running and managing crawlers and parsing results

CrawlerManager can be used in combination with Harvester web interface to run queries and load results.

Installing

Make sure you have the proper system dependencies with

Setup

WARNING

Currently, for Harvester to save data, you need to have the path /home/user/Data/KG/ and /home/user/Data/KG/All_Pics/ to exist. This is kludgy and will be configurable soon!

Running CrawlerManager
Additional Configuration

To use proxies, set environment variable PROXYLIST to the path to the proxylist you want to use.

To solve CAPTCHAs, set environment variable SOLVERDETAILS to your 2Captcha key.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.