Name: es-amazon-s3-river
Owner: NYT Newsroom Developers
Description: Amazon S3 river for Elasticsearch
Created: 2015-04-23 21:23:01.0
Updated: 2015-04-24 18:35:36.0
Pushed: 2015-04-24 18:35:36.0
Homepage: null
Size: 1284
Language: Java
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Amazon S3 river for Elasticsearch
This river plugin helps to index documents from a Amazon S3 account buckets.
WARNING: For 0.0.1 released version, you need to have the Attachment Plugin.
WARNING: Starting from 0.0.2, you don't need anymore the Attachment Plugin as we use now directly Tika, see issue #2.
Amazon S3 River Plugin | ElasticSearch | Attachment Plugin | Tika |
master (1.3.1-SNAPSHOT) | 1.3.x | No more used | 1.4 |
1.3.0 | 1.3.x | No more used | 1.4 |
1.2.0 | 1.2.x | No more used | 1.4 |
0.0.4 | 1.0.x and 1.1.x | No more used | 1.4 |
0.0.3 | 1.0.0 | No more used | 1.4 |
0.0.2 | 0.90.0 | No more used | 1.4 |
0.0.1 | 0.90.0 | 1.7.0 |
Just install as a regular Elasticsearch plugin by typing :
n/plugin --install com.github.lbroudoux.elasticsearch/amazon-s3-river/1.2.0
This will do the job…
nstalling com.github.lbroudoux.elasticsearch/amazon-s3-river/1.2.0...
ng http://download.elasticsearch.org/com.github.lbroudoux.elasticsearch/amazon-s3-river/amazon-s3-river-1.2.0.zip...
ng http://search.maven.org/remotecontent?filepath=com/github/lbroudoux/elasticsearch/amazon-s3-river/1.2.0/amazon-s3-river-1.2.0.zip...
loading ......DONE
alled amazon-s3-river
Because Jeb Bush's email server exports RFC822 non-compliant emails, we needed to modify Tika to detect them properly, sadly. To build this project, you need to run the following three commands:
mvn install:install-file -Dfile=jar/tika-core-1.9-SNAPSHOT.jar -DgroupId=org.apache.tika -DartifactId=tika-core -Dversion=1.9 -Dpackaging=jar
mvn install:install-file -Dfile=jar/tika-parsers-1.9-SNAPSHOT.jar -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=1.9 -Dpackaging=jar
mvn install
The final command creates in target/
your output binary
First, you need to login to Amazon AWS account owning the S3 bucket to and then retrieve your security credentials by visiting this page.
Once done, you should note your accessKey
and secretKey
codes.
We create first an index to store our documents (optional):
rl -XPUT 'http://localhost:9200/mys3docs/' -d '{}'
We create the river with the following properties :
myownbucket
Work/
(This is optional. If specified, it should be an existing path with the trailing /)*.doc
and *.pdf
*.zip
and *.gz
rl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
ype": "amazon-s3",
mazon-s3": {
"accessKey": "AAAAAAAAAAAAAAAA",
"secretKey": "BBBBBBBBBBBBBBBB",
"name": "My Amazon S3 feed",
"bucket" : "myownbucket"
"pathPrefix": "Work/",
"update_rate": 900000,
"includes": "*.doc,*.pdf",
"excludes": "*.zip,*.gz"
By default, river is using an index that have the same name (mys3docs
in the above example).
From 0.0.2 version
The source_url
of documents is now stored within Elasticsearch index in order to allow you to access
later the whole document content from your application (this is indeed a use case coming from Scrutmydocs).
By default, the plugin uses what is called the resourceUrl of a S3 bucket document. If the document have been made public within S3, it can be accessed directly from your browser. If it's not the case, the stored url is intended to be used by a regular S3 client that has the allowed set of credentials to access the document.
Another option to easily distribute S3 content is to setup a Web proxy in front of S3 such as CloudFront (see
Service Private Content With CloudFront).
In that later case, you'll want to rewrite source_url
by substituting the S3 part by your own host name. This
plugin allows you to do that by specifying a download_host
as a river properties.
Index options can be specified when creating an amazon-s3-river. The properties are the following :
You'll have to use them as follow when creating a river :
rl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
ype": "amazon-s3",
mazon-s3": {
"accessKey": "AAAAAAAAAAAAAAAA",
"secretKey": "BBBBBBBBBBBBBBBB",
"name": "My Amazon S3 feed",
"bucket" : "myownbucket"
"pathPrefix": "Work/",
"update_rate": 900000,
"includes": "*.doc,*.pdf",
"excludes": "*.zip,*.gz"
ndex": {
"index": "amazondocs",
"type": "doc",
"bulk_size": 50
From 0.0.4 version
If you want to index Json files directly without parsing them through Tika, you can set the json_support
configuration
option to true
like
rl -XPUT 'http://localhost:9200/_river/mys3docs/_meta' -d '{
ype": "amazon-s3",
mazon-s3": {
"accessKey": "AAAAAAAAAAAAAAAA",
"secretKey": "BBBBBBBBBBBBBBBB",
"name": "My Amazon S3 feed",
"bucket" : "myownbucket"
"pathPrefix": "Jsons/",
"update_rate": 900000,
"json_support": true,
"includes": "*.json"
Be sure in your river configuration to correclty use includes
or excludes
to only retrieve Json documents.
In this case of json_support
and if you did not define a mapping prior creating it, this river will not
automatically generate a mapping like mentioned below into the Advanced section. In this case, Elasticsearch will auto
guess the mapping.
If you need to stop a river, you can call the _s3
endpoint with your river name followed by the _stop
command like this :
_s3/mys3docs/_stop
To restart the river from the previous point, just call the corresponding _start
endpoint :
_s3/mys3docs/_start
When the river detect a new type, it creates automatically a mapping for this type.
oc" : {
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "keyword"
},
"modifiedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"file" : {
"type" : "attachment",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : "yes"
}
}
}
}
From 0.0.2 version
We now use directly Tika instead of the mapper-attachment plugin.
oc" : {
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "keyword"
},
"modifiedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"source_url" : {
"type" : "string"
},
"file" : {
"properties" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : "yes"
}
}
}
}
Normally, the River will try to keep the ElasticSearch repository in sync with what is on S3. This means that if a record is deleted on S3, it is removed from ElasticSearch and vice versa. This can be very memory and time-intensive for large collections of documents, since it means that all keys must be retrieved from the bucket first into a large array. Then this array is used to see if any keys are no longer in ElasticSearch and should be deleted from S3. For large collections of millions of files, you will see Java run out of heap memory when attempting to index the river. I've made a few changes to make it more memory-efficient:
"deleteS3": false
in the _meta
configuration for the river. This will disable the synchronization for deletes between ElasticSearch and S3. In most cases, you might want to disable the synchronization even for small collections and use a read-only key for accessing the S3 collections if you never want source documents to be deleted. software is licensed under the Apache 2 license, quoted below.
right 2013 Laurent Broudoux
nsed under the Apache License, Version 2.0 (the "License"); you may not
this file except in compliance with the License. You may obtain a copy of
License at
http://www.apache.org/licenses/LICENSE-2.0
ss required by applicable law or agreed to in writing, software
ributed under the License is distributed on an "AS IS" BASIS, WITHOUT
ANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
nse for the specific language governing permissions and limitations under
License.