CenterForOpenScience/sharepa

Name: sharepa

Owner: Center for Open Science

Description: A python client for browsing and analyzing SHARE data (https://osf.io/share)

Created: 2015-06-04 13:41:00.0

Updated: 2016-12-05 00:35:17.0

Pushed: 2017-03-24 20:08:59.0

Homepage:

Size: 203

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

sharepa

Coverage Status Code Climate

A python client for browsing and analyzing SHARE data (http://share-research.readthedocs.io/en/latest/), gathered with the SHARE Processing Pipeline (https://github.com/CenterForOpenScience/SHARE). It builds heavily (almost completely) on the elasticsearch-dsl package for handling Elasticsearch querying and aggregations, and contains some additional utilities to help with graphing and analyzing the data.

Use Binder to run some SHARE data tutorials online! Click here: Binder

Installation

You can install sharepa using pip (inside a virtualenv):

pip install git+https://github.com/CenterForOpenScience/sharepa@develop

note The version above will work with SHARE v2's elasticsearch API. To install the version to run with V1 of the SHARE API, run `pip install sharepa`

Getting Started

Here are some basic searches to get started parsing through SHARE data.

Basic Search

A basic search will provide access to all documents in SHARE - in 10 document slices.

Count

You can use sharepa and the basic search to get the total number of documents in SHARE

 sharepa import basic_search


t(basic_search.count())
Iterating through results

Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results

lts = basic_search.execute()

hit in results:
print(hit.title)

If we don't want 10 results, or we want to offset the results, we can use slices

lts = basic_search[5:10].execute()
hit in results:
print(hit.title)
Advanced Search

You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using lucene query syntax.

 sharepa import ShareSearch

earch = ShareSearch()

earch = my_search.query(
'exists', # Type of query, will accept a field to check if exists
field='tags', # This lucene query string will find all documents that have tags

This type of query accepts a 'exists'. Other options include a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.

We can see what that query that we're about to send to elasticsearch by using the pretty print helper function:

 sharepa.helpers import pretty_print


ty_print(my_search.to_dict())


"query": {
    "exists": {
        "field": "tags"
    }
}

When you execute that query, you can then iterate through the results the same way that you could with the simple search query.

results = my_search.execute()
hit in new_results:
print(hit.title)
Aggregations for data analysis

While searching for individual results is useful, sharepa also lets you make aggregation queries that give you results across the entirety of the SHARE dataset at once. This is useful if you're curious about the completeness of data sets. For example, we can find the number of documents per source that are missing titles.

We can add an aggregation to my_search that will give us the number of documents per source that meet that previously defined search query (in our case, items that don't have tags). Here's what adding that aggregation will look like -

earch.aggs.bucket(
'sources',  # Every aggregation needs a name
'terms',  # There are many kinds of aggregations, terms is a pretty useful one though
field='sources',  # We store the source of a document in the sources field
size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs
min_doc_count=0

We can see which query is actually going to be sent to elasticsearch by printing out the query.

ty_print(my_search.to_dict())


"query": {
    "exists": {
        "field": "tags"
    }
},
"aggs": {
    "sources": {
        "terms": {
            "field": "_type", 
            "min_doc_count": 0, 
            "size": 0
        }
    }
}

This is the actual query that will be sent to the SHARE API. You can see that it added a section called “aggs” to the basic query that we made earlier.

You can access the aggregation data for basic plotting, and analysis, by accessing the bucket

Basic Plotting

Sharepa has some basic functions to get you started making plots using matplotlib and pandas.

Raw sharepa data is in the same format as elasticsearch results, represented as a nested structure. To convert the data into a format that pandas can recognize, we have to convert it into a dataframe.

Creating a dataframe from sharepa data

We can use the bucket_to_dataframe function to convert the elasticsearch formatted data into a pandas dataframe. To do this, we pass the title of the new column we want created, and the place to find the nested aggregation data.

Let's re-execute the my_search command including the updated query and update the new_results variable.

results = my_search.execute()

To convert these results to a pandas dataframe, we'll look within the appropriate results search bucket, in this case within `new_results.aggregations.sourceAgg.buckets`

 sharepa import bucket_to_dataframe
 matplotlib import pyplot

ata_frame = bucket_to_dataframe('# documents by source - No Tags', new_results.aggregations.sources.buckets)
ata_frame.plot(kind='bar')
ot.show()

This will create a bar graph showing all of the sources, and document counts for each source matching our query of items that do not have tags.

You can also sort the data based on a certain column, in this case, '# documents by source - No Tags'

ata_frame.sort(ascending=False, columns='# documents by source - No Tags').plot(kind='bar')
ot.show()
Advanced Aggregations

Let's make a more interesting aggregation. Let's look at the documents that are missing titles, by source.

 elasticsearch_dsl import F, Q

earch.aggs.bucket(
'missingTitle',  # Name of the aggregation
'filters', # We'll want to filter all the documents that have titles
filters={ 
    'missingTitle': F(  # F defines a filter
        'fquery',  # This is a query filter which takes a query and filters document by it
        query=Q(  # Q can define a query
            'query_string', # The type of aggregation
            query='NOT title:*',  # This will match all documents that don't have content in the title field
            analyze_wildcard=True,
        )
    ) 
}
tric(  # but wait, that's not enough! We need to break it down by source as well
'sourceAgg',
'terms',
field='sources',
size=0,
min_doc_count=0

We can check out what the query looks like now:

ty_print(my_search.to_dict()) 


"query": {
    "query_string": {
        "analyze_wildcard": true, 
        "query": "NOT tags:*"
    }
}, 
"aggs": {
    "sources": {
        "terms": {
            "field": "sources", 
            "min_doc_count": 0, 
            "size": 0
        }
    }, 
    "missingTitle": {
        "aggs": {
            "sourceAgg": {
                "terms": {
                    "field": "sources", 
                    "min_doc_count": 0, 
                    "size": 0
                }
            }
        }, 
        "filters": {
            "filters": {
                "missingTitle": {
                    "fquery": {
                        "query": {
                            "query_string": {
                                "query": "NOT title:*", 
                                "analyze_wildcard": true
                            }
                        }
                    }
                }
            }
        }
    }
}

Wow this query has gotten big! Good thing we don't have to define it by hand.

Now we just need to execute the search:

esults = my_search.execute()

Let's check out the results, and make sure that there are indeed no tags.

hit in my_results:
print(hit.title, hit.get('tags'))  # we can see there are no tags in our results

Let's pull out those buckets and turn them into dataframes for more analysis

ing_title = bucket_to_dataframe('missingTitle', my_results.aggregations.missingTitle.buckets.missingTitle.sourceAgg.buckets)
hes = bucket_to_dataframe('matches', my_results.aggregations.sources.buckets)

It'd be great if we could merge this dataframe with another that has information about all of the documents. Luckilly we have a built in function that will give us that data frame easily, called source_counts.

We can use that dataframe and merge it with our newly created one:

 sharepa.helpers import source_counts
 sharepa.analysis import merge_dataframes


ed = merge_dataframes(source_counts(), matches,  missing_title)

We can also easily do computations on these columns, and add those to the dataframe. Here's a way to get a pandas dataframe with a column for a percent from each source that is missing tags and a title:

ed['percent_missing_tags_and_title'] = (merged.missingTitle / merged.total_source_counts) * 100
ot.show()
Examples

The following examples cover some of the more common use cases of sharepa. They are by no means exhaustive, for more information see the elasticsearch and elasticsearch-dsl documentation.

Query examples

Queries and Filters are very similar, and have many overlaping search types (e.g. filter by range vs query by range) Queries sort returned hits by relevance (using the _score feild), filters ignore revelence and just find documents that match the search criteria given.

From Elastic search docs: ``

As a general rule, queries should be used instead of filters:
-for full text search
-where the result depends on a relevance score

``

Ex: Lets get all the documents with titles containing the word 'cell' with regex:

earch = ShareSearch() #create search object
earch = my_search.query(
"regexp", #the first arg in a query or filter is the type of filter/query to be employed
title='.*cell.*' #then come the arguments, these are different depending on type of query is used, but generally: name_of_the_feild_to_be_operated_on='argument_values'

Ex: Or we can get all documents from MIT:

earch = ShareSearch() #create search object
earch = my_search.query(
"match", #the first arg in a query or filter is the type of filter/query to be employed
source='mit' #then come the arguments, these are different depending on type of query is used, but generally: name_of_the_feild_to_be_operated_on='argument_values'

For more information on query types, see the elasticsearch docs

Filters

From Elastic search docs: `` As a general rule, filters should be used instead of queries:

For more filter types see: Elasticsearch Filter Docs

Ex: Applying a filter to a search. Here, results will only contain hits between 14-06-01 and 15-06-01

earch = ShareSearch() #create search object
earch = my_search.filter( #apply filter to search
"range", #applied a range type filter
providerUpdatedDateTime={ #the feild in the data we compare
    'gte':'2014-01-01', #hits must be greater than or equal to this date and...
    'lte':'2015-01-01' #hits must be less than or equal to this date
}

Ex: We can add a second filter to the first, now hits will match both filters (date range and tags that start with 'ba'). Note: there are many ways to write filters/queries depending on the level of abstraction you want from elasticsearch.

re is a pure elasticsearch-dsl filter
earch = my_search.filter(
 "prefix",
 tags="ba"


re is the same search as a mix of elasticsearch-dsl and elasticsearch where the args are input as a dictionary a la elasticsearch
earch = my_search.filter(
 "prefix",
 **{"tags": "ba"}


 can also match elasticsearch syntax exactly, and input the raw dictionary into the filter method
earch = my_search.filter(
{
    "prefix": {"tags": "ba"}
}


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.