nasa-jpl-memex/sce

Name: sce

Owner: NASA JPL MEMEX

Owner: NASA JPL MEMEX

Description: Sparkler Crawl Environment - a packaged, dockerized version of http://github.com/USCDataScience/sparkler.git

Created: 2017-06-05 21:54:07.0

Updated: 2018-02-25 22:57:03.0

Pushed: 2017-11-01 16:09:37.0

Homepage: http://irds.usc.edu/sparkler/

Size: 57

Language: Shell

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Sparkler Crawl Environment

The Sparkler Crawl Environment aims at providing an efficient, scalable, consistent and reliable software architecture consisting of domain discovery tools able to enrich a given domain by expanding the collection of artifacts that define the domain.

This repository, named sce, provides a command-line utility for building Sparkler Crawl Environment as a multi-container Docker application running through the Docker Compose tool on a single node. As a PoC, you can easily install the Sparkler Crawl Environment on a single node using the kickstart.sh bash script that automatically builds and starts up all the software components:

./kickstart.sh [-l /path/to/log]

The Sparkler Crawl Environment is built on top of Sparkler, a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

Sparkler Crawl Environment

The Sparkler Crawl Environment aims at providing an efficient, scalable, consistent and reliable software architecture consisting of domain discovery tools able to enrich a given domain by expanding the collection of artifacts that define the domain.

This repository, named sce, provides a command-line utility for building Sparkler Crawl Environment as a multi-container Docker application running through the Docker Compose tool on a single node. As a PoC, you can easily install the Sparkler Crawl Environment on a single node using the kickstart.sh bash script that automatically builds and starts up all the software components:

./kickstart.sh [-l /path/to/log]

The Sparkler Crawl Environment is built on top of Sparkler, a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.