Name: memex-gate
Owner: NASA JPL MEMEX
Owner: NASA JPL MEMEX
Description: General Architecture for Text Engineering
Created: 2015-05-22 22:14:18.0
Updated: 2018-02-25 23:15:49.0
Pushed: 2016-03-23 21:11:37.0
Homepage: null
Size: 242630
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
A server side application and environment for running large scale General Architecture Text Engineering tasks over document resources such as online ads, debarment information, federal and district court appeals, press releases, news articles, social media streams, etc. The MemexGATE application is itself run in conjunction with Behemoth to provide an annotation-based implementation of document corpi and a number of modules operating on these documents. The project can be used to simplify the deployment of document analysers on a large scale.
This tool heavily leverages the GATE software. GATE is an acronym for General Architecture for Text Engineering. Please see below for all of the steps required to use the software. The document corpus' I've made available can be used with the MemexGATE application to do interesting things with legal documents such as
MemexGATE is available on Dockerhub for rapid deployment and prototyping of textual document engineering and processing pipelines. To get the MemexGATE application and environment make sure you have Docker installed then simply
cker pull lewismc/memex-gate
cker run -t -i lewismc/memex-gate /bin/bash
If you are on MacOSX you may need to run the following two commands first
ot2docker start
boot2docker shellinit)
You will not be within your own environment with all of the tools required to run MemexGATE, namely Hadoop 2.2.0, Mahout 0.10.0, Tika 1.9, Gate 8.1, etc. You can run MemexGATE as follows
@e4e137838adc:/usr/local# memexgate
____ ________ ___________________________
\ ____ _____ ____ ___ ___/ _____/ / _ \__ ___/\_ _____/
\ / \_/ __ \ / \_/ __ \ \/ / \ ___ / /_\ \| | | __)_
Y \ ___/| Y Y \ ___/ > <\ \_\ \/ | \ | | v0.1 \
_|__ /\___ >__|_| /\___ >__/\_ \______ /\____|__ /____| /_______ /
\/ \/ \/ \/ \/ \/ \/ \/
er side framework for large scale General Architecture Text Engineering tasks.
e: run COMMAND
e COMMAND is one of:
Warc load documents from WARC
Nutch load documents from Nutch segment(s)
Hadoop load documents from Hadoop Sequence files
porter generate a SequenceFile containing BehemothDocuments given a directory of raw docs
ader read and inspect document corpus
porter read and execute intermediate document extraction creating new corpus
lter filter documents and create new corpus
te process documents using MemexGATE apps
ka parse documents using Tika
ma process documents using UIMA
hout generate vectors for clustering with Mahout
lr send documents to Solr for indexing
astic send documents to ElasticSearch for indexing
nguage-id identify the language of documents
commands print help when invoked w/o parameters.
There is VERY little installation required to run MemexGATE over and above provisioning your Hadoop node/cluster and then installing Behemoth as stated in the prerequisites above. MemexGATE is a first class citizen within the Behemoth framework meaning that the Behemoth Processing with GATE instructions can be followed to the T.
This follows the following procedure
op fs -copyFromLocal /mylocalpath/legisgate.zip /apps/legisgate.zip
perty>
e>gate.annotationset.input</name>
ue></value>
cription>Map the information at the behemoth format onto the select annotationset
scription>
operty>
perty>
e>gate.annotationset.output</name>
ue></value>
cription>AnnotationSet to consider when serializing to the behemoth format
scription>
operty>
perty>
e>gate.annotations.filter</name>
ue>Token</value>
cription>Annotations types to consider when serializing to the behemoth format, separated by commas
scription>
operty>
perty>
e>gate.features.filter</name>
ue>Token.string</value>
cription>if specified, only the feature listed for a type will be kept
scription>
operty>
perty>
e>gate.emptyannotationset</name>
ue>false</value>
cription>if specified all the annotations in the Behemoth document will be deleted before
essing with GATE </description>
operty>
Run MemexGATE on your Behemoth document corpus as follows
op jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver
put path" "target output path" /apps/legisgate.zip
hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver
ta/behemothcorpus /data/behemoth_legisgate_corpus /apps/legsigate.zip
If you've followed the Behemoth installation instructions and successfully run legisgate from within Behemoth, you are ready to explore other Behemoth modules. For example, a next step might be to use the Behemoth Solr Module to persist the data into an indexing engine such as Apache Solr or maybe Elasticsearch.
A huge degree of thanks go to Julien Nioche of DigitalPebble Ltd. who developed and maintains the Behemoth software. Thank you Julien for licensing your code under ALv2.0. This work is funded through the DARPA Memex project.
Lewis John McGibbney 0 lewis.j.mcgibbney@jpl.nasa.gov
MemexGATE is licensed permissively under the Apache Software License v2.0
A server side application and environment for running large scale General Architecture Text Engineering tasks over document resources such as online ads, debarment information, federal and district court appeals, press releases, news articles, social media streams, etc. The MemexGATE application is itself run in conjunction with Behemoth to provide an annotation-based implementation of document corpi and a number of modules operating on these documents. The project can be used to simplify the deployment of document analysers on a large scale.
This tool heavily leverages the GATE software. GATE is an acronym for General Architecture for Text Engineering. Please see below for all of the steps required to use the software. The document corpus' I've made available can be used with the MemexGATE application to do interesting things with legal documents such as
MemexGATE is available on Dockerhub for rapid deployment and prototyping of textual document engineering and processing pipelines. To get the MemexGATE application and environment make sure you have Docker installed then simply
cker pull lewismc/memex-gate
cker run -t -i lewismc/memex-gate /bin/bash
If you are on MacOSX you may need to run the following two commands first
ot2docker start
boot2docker shellinit)
You will not be within your own environment with all of the tools required to run MemexGATE, namely Hadoop 2.2.0, Mahout 0.10.0, Tika 1.9, Gate 8.1, etc. You can run MemexGATE as follows
@e4e137838adc:/usr/local# memexgate
____ ________ ___________________________
\ ____ _____ ____ ___ ___/ _____/ / _ \__ ___/\_ _____/
\ / \_/ __ \ / \_/ __ \ \/ / \ ___ / /_\ \| | | __)_
Y \ ___/| Y Y \ ___/ > <\ \_\ \/ | \ | | v0.1 \
_|__ /\___ >__|_| /\___ >__/\_ \______ /\____|__ /____| /_______ /
\/ \/ \/ \/ \/ \/ \/ \/
er side framework for large scale General Architecture Text Engineering tasks.
e: run COMMAND
e COMMAND is one of:
Warc load documents from WARC
Nutch load documents from Nutch segment(s)
Hadoop load documents from Hadoop Sequence files
porter generate a SequenceFile containing BehemothDocuments given a directory of raw docs
ader read and inspect document corpus
porter read and execute intermediate document extraction creating new corpus
lter filter documents and create new corpus
te process documents using MemexGATE apps
ka parse documents using Tika
ma process documents using UIMA
hout generate vectors for clustering with Mahout
lr send documents to Solr for indexing
astic send documents to ElasticSearch for indexing
nguage-id identify the language of documents
commands print help when invoked w/o parameters.
There is VERY little installation required to run MemexGATE over and above provisioning your Hadoop node/cluster and then installing Behemoth as stated in the prerequisites above. MemexGATE is a first class citizen within the Behemoth framework meaning that the Behemoth Processing with GATE instructions can be followed to the T.
This follows the following procedure
op fs -copyFromLocal /mylocalpath/legisgate.zip /apps/legisgate.zip
perty>
e>gate.annotationset.input</name>
ue></value>
cription>Map the information at the behemoth format onto the select annotationset
scription>
operty>
perty>
e>gate.annotationset.output</name>
ue></value>
cription>AnnotationSet to consider when serializing to the behemoth format
scription>
operty>
perty>
e>gate.annotations.filter</name>
ue>Token</value>
cription>Annotations types to consider when serializing to the behemoth format, separated by commas
scription>
operty>
perty>
e>gate.features.filter</name>
ue>Token.string</value>
cription>if specified, only the feature listed for a type will be kept
scription>
operty>
perty>
e>gate.emptyannotationset</name>
ue>false</value>
cription>if specified all the annotations in the Behemoth document will be deleted before
essing with GATE </description>
operty>
Run MemexGATE on your Behemoth document corpus as follows
op jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver
put path" "target output path" /apps/legisgate.zip
hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver
ta/behemothcorpus /data/behemoth_legisgate_corpus /apps/legsigate.zip
If you've followed the Behemoth installation instructions and successfully run legisgate from within Behemoth, you are ready to explore other Behemoth modules. For example, a next step might be to use the Behemoth Solr Module to persist the data into an indexing engine such as Apache Solr or maybe Elasticsearch.
A huge degree of thanks go to Julien Nioche of DigitalPebble Ltd. who developed and maintains the Behemoth software. Thank you Julien for licensing your code under ALv2.0. This work is funded through the DARPA Memex project.
Lewis John McGibbney 0 lewis.j.mcgibbney@jpl.nasa.gov
MemexGATE is licensed permissively under the Apache Software License v2.0