Name: ckanext-datasolr
Owner: Natural History Museum
Description: datasolr is a Ckan extension to use Solr for datastore queries
Created: 2015-03-13 17:47:59.0
Updated: 2017-08-23 09:52:10.0
Pushed: 2018-03-19 16:35:32.0
Homepage: null
Size: 125
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
datasolr is a Ckan extension to use Solr to perform datastore queries.
Motivated by low PostgreSQL performance on very large datasets, datasolr provides an alternative API endpoint to perform searches using Solr. datasolr is compatible with and can be configured to replace the datastore_search
API endpoint. The returned results may differ however (see differences with datastore_search).
datasolr aims to replace the search component of the datastore only. It is not a full replacement for the datastore, and it's use case is for large datasets that are either not updated, or updated at regular intervals only. As such:
datastore_search
allows for PostgreSQL full text query syntax. datasolr does not, and does not attempt to parse the PostgreSQL syntax into Solr queries (with the exception of field full text search prefix - see below);q
parameter passed to datastore_search is typically a full text search string, it can also be a dictionary of field to values - the idea being to implement full text search on individual fields. datasolr does not implement this as a full text search, but as a wildcard search instead. Optional PostgreSQL full text query syntax prefix component :*
is stripped from field full text searches.datasolr also provides some extra features:
solr_stats_fields
as a request parameter, which lists the fields to fetch statistics for. The statistics are added to the field definition object in fields
;_solr_not_empty
, which expects a list of fields, will ensure the given fields are not empty.Note that you will need some good understanding of Solr to use this extension. Out-of-the box, datasolr does not provide schemaless or dynamic field mapping into Solr - as such, you need to write a schema for the specific dataset on which you wish to use datasolr.
Note that datasolr can be extended to provide schemaless and/or dynamic field mapping, and future versions may provide this.
For now the typical usage would be:
datasolr is configured in the main Ckan configuration file. You first want to add datasolr to your list of plugins (such that it is the first plugin to use the IDataSolr interface), and you can then configure it with the following keys (note that most of them are not necessary - the defaults are sensible):
ether to replace datastore_search api calls or not. Set this to False
he default) until you're happy datasolr is working.
solr.replace_datastore_search = False
e action to fall back to when a given resource is not handled by datasolr.
e default is the main ckan datastore_search action. Unless you're using
other plugin that overrides this (eg. ckanext-dataproxy) you do not need
change this.
solr.fallback = ckanext.datastore.logic.action.datastore_search
low are the parameters used by all queries that do not have a resource
ecific configuration. Typically, unless you implement dynamic field
pping, you would only want resource specific configuration.
e method used to map API field names to Solr field names. The default
plementation strips characters that are not allowed as Solr field names.
e this if you have non alphanumeric characters in your field names. This
uld also be used to provide dynamic field mapping in Solr.
solr.field_mapper = ckanext.datasolr.lib.solrqueryapi.default_field_mapper
e Solr search url, including the core and searcher
solr.search_url = http://localhost:8080/solr/collection2/select
unique field in the dataset. This defaults to `_id`, which is what
e datastore uses as primary key. You may use this if your `_id` is not
nsistent across rebuilds, and you'd rather use a different field.
solr.id_field = _id
e field in the Solr schema that matches the key defined in id_field
solr.solr_id_field = _id
e field in the Solr schema that holds the resource id. If this is present,
en the resource id will be included in all queries. This means you can use
single Solr core for multiple datasets. If you dedicate a core to a
ngle dataset, then you might as well omit this.
solr.resource_id_field = resource_id
source specific configuration contain the same fields, prefixed
`resource.<resource id>`.
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.field_mapper = ckanext.datasolr.lib.solrqueryapi.default_field_mapper
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.search_url = http://localhost:8080/solr/collection2/select
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.id_field = _id
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.solr_id_field = _id
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.resource_id_field = resource_id
The field mapper, allowing users to implement different field mapping strategies such as dynamic fields, can be set directly in the configuration (see configuration).
It is also possible to extend how datasolr builds queries by implementing the IDataSolr
interface, which is analogous to the the IDatastore
interace. The IDataSolr
interface is documented in the source code.
Here is an example implementation that adds a custom filter to return all rows which have a value for image_url
:
rt ckan.plugins as p
ckanext.datasolr.interfaces import IDataSolr
s MyPlugin(p.SingletonPlugin):
p.implements(IDataSolr)
def datasolr_validate(self, context, data_dict, field_types):
""" Validate the query by removing all filters that we manage """
if 'filters' in data_dict and '_has_image' in data_dict['filters']:
del data_dict['filters']['_has_image']
def datasolr_search(self, context, data_dict, field_types, query_dict):
""" Add our custom search terms """
if 'filters' in data_dict and '_has_image' in data_dict['filters']:
query_dict['q'][0].append('image_url:[* TO *]')
return query_dict
Solr offers a way to index data directly from a PostgreSQL database using the Data Import Request Handler module.
To index your resources in this way you will need to:
Add the PostgreSQL JDBC driver to your Solr installation;
Add the solr dataimport handler jar to your Solr installation;
Add a data import handler section in your schema.xml
, for instance:
uestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
questHandler>
Add a data-config.xml
file to describe how to import the resource to index, for instance:
<dataConfig>
<dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://my_postgres_server:5432/datastore_default"
user="datastore_default"
password="my_secret_password" />
<document name="my_dataset">
<entity name="my_entity"
query="
SELECT "_id", "myfield", "my_other_field"
FROM "my_resource_id" ORDER BY _id ASC"
>
</entity>
</document>
</dataConfig>