NaturalHistoryMuseum/ckanext-datasolr

Name: ckanext-datasolr

Owner: Natural History Museum

Description: datasolr is a Ckan extension to use Solr for datastore queries

Created: 2015-03-13 17:47:59.0

Updated: 2017-08-23 09:52:10.0

Pushed: 2018-03-19 16:35:32.0

Homepage: null

Size: 125

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Ckan Datastore Solr extension

datasolr is a Ckan extension to use Solr to perform datastore queries.

Motivated by low PostgreSQL performance on very large datasets, datasolr provides an alternative API endpoint to perform searches using Solr. datasolr is compatible with and can be configured to replace the datastore_search API endpoint. The returned results may differ however (see differences with datastore_search).

Use case

datasolr aims to replace the search component of the datastore only. It is not a full replacement for the datastore, and it's use case is for large datasets that are either not updated, or updated at regular intervals only. As such:

Differences with datastore_search

datasolr also provides some extra features:

Usage

Note that you will need some good understanding of Solr to use this extension. Out-of-the box, datasolr does not provide schemaless or dynamic field mapping into Solr - as such, you need to write a schema for the specific dataset on which you wish to use datasolr.

Note that datasolr can be extended to provide schemaless and/or dynamic field mapping, and future versions may provide this.

For now the typical usage would be:

  1. Identify datasets that would benefit from faster searches;
  2. Create a schema for that particular dataset (either on a shared core or dedicated core). The field names should be the same as the database field names. Not that Solr fields only support alphanumeric and underscore characters - if your dataset does not conform to this, you will need to provide a custom field mapper (see configuration);
  3. Index your dataset (see indexing with data import)
  4. Install and configure datasolr.
Configuration

datasolr is configured in the main Ckan configuration file. You first want to add datasolr to your list of plugins (such that it is the first plugin to use the IDataSolr interface), and you can then configure it with the following keys (note that most of them are not necessary - the defaults are sensible):

ether to replace datastore_search api calls or not. Set this to False 
he default) until you're happy datasolr is working.
solr.replace_datastore_search = False

e action to fall back to when a given resource is not handled by datasolr.
e default is the main ckan datastore_search action. Unless you're using
other plugin that overrides this (eg. ckanext-dataproxy) you do not need
 change this.
solr.fallback = ckanext.datastore.logic.action.datastore_search


low are the parameters used by all queries that do not have a resource
ecific configuration. Typically, unless you implement dynamic field
pping, you would only want resource specific configuration.


e method used to map API field names to Solr field names. The default 
plementation strips characters that are not allowed as Solr field names.
e this if you have non alphanumeric characters in your field names. This
uld also be used to provide dynamic field mapping in Solr.
solr.field_mapper = ckanext.datasolr.lib.solrqueryapi.default_field_mapper

e Solr search url, including the core and searcher
solr.search_url = http://localhost:8080/solr/collection2/select

unique field in the dataset. This defaults to `_id`, which is what
e datastore uses as primary key. You may use this if your `_id` is not 
nsistent across rebuilds, and you'd rather use a different field.
solr.id_field = _id

e field in the Solr schema that matches the key defined in id_field
solr.solr_id_field = _id

e field in the Solr schema that holds the resource id. If this is present,
en the resource id will be included in all queries. This means you can use
single Solr core for multiple datasets. If you dedicate a core to a
ngle dataset, then you might as well omit this.
solr.resource_id_field = resource_id


source specific configuration contain the same fields, prefixed
 `resource.<resource id>`.


solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.field_mapper = ckanext.datasolr.lib.solrqueryapi.default_field_mapper
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.search_url = http://localhost:8080/solr/collection2/select
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.id_field = _id
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.solr_id_field = _id
solr.resource.75cc58ff-db88-4ca7-a321-9bb24a89b781.resource_id_field = resource_id
Extending datasolr

The field mapper, allowing users to implement different field mapping strategies such as dynamic fields, can be set directly in the configuration (see configuration).

It is also possible to extend how datasolr builds queries by implementing the IDataSolr interface, which is analogous to the the IDatastore interace. The IDataSolr interface is documented in the source code.

Here is an example implementation that adds a custom filter to return all rows which have a value for image_url:

rt ckan.plugins as p
 ckanext.datasolr.interfaces import IDataSolr

s MyPlugin(p.SingletonPlugin):
p.implements(IDataSolr)

def datasolr_validate(self, context, data_dict, field_types):
    """ Validate the query by removing all filters that we manage """
    if 'filters' in data_dict and '_has_image' in data_dict['filters']:
        del data_dict['filters']['_has_image']

def datasolr_search(self, context, data_dict, field_types, query_dict):
    """ Add our custom search terms """
    if 'filters' in data_dict and '_has_image' in data_dict['filters']:
        query_dict['q'][0].append('image_url:[* TO *]')
    return query_dict
Indexing with data import

Solr offers a way to index data directly from a PostgreSQL database using the Data Import Request Handler module.

To index your resources in this way you will need to:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.