TransparencyToolkit/Harvester

Name: Harvester

Owner: Transparency Toolkit

Description: Web crawling and document processing through a usable interface.

Created: 2015-11-25 17:06:20.0

Updated: 2018-01-11 07:59:05.0

Pushed: 2017-07-22 15:55:59.0

Homepage: https://transparencytoolkit.org

Size: 60036

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Harvester

Harvester is a tool to crawl websites and OCR/extract metadata from documents, all through a usable graphical interface. The goal is for journalists, activists, and researchers to be able to rapidly collect open source intelligence (OSINT) from public websites and convert any set of documents into machine readable form without programming or complex technical setup.

Harvester requires DocManager so that it can index the data with Elasticsearch. Harvester can also be used with LookingGlass to seamlessly generate searchable archives of crawled data and processed documents.

Installation

Dependencies
Setup Instructions
  1. Install the dependencies

  2. Download elasticsearch (https://www.elastic.co/downloads/elasticsearch)

  3. Download rvm (https://rvm.io/rvm/install)

  4. Install Ruby: Run rvm install 2.4.1 and rvm use 2.4.1

  5. Install Rails: gem install rails

  6. Install Debian dependencies: sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev libmagickcore-dev libmagickwand-dev mongodb

  7. Follow the installation instructions for DocManager

  8. Install Redis: instructions for Debian

  9. Install Tika & Tesseract (optional)

NOTE: By default document conversion (pdf, docs, etc..) is handled by GiveMeText, this approach sends your documents over the clear internet. DO NOT USE THIS with sensitive documents, instead install Tika & Tesseract as described below.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.