datamade/Levenshtein_search

Name: Levenshtein_search

Owner: datamade

Description: Python search module for fast approximate string matching

Forked from: mattandahalfew/Levenshtein_search

Created: 2016-11-08 15:56:27.0

Updated: 2017-07-14 23:56:31.0

Pushed: 2017-04-24 01:27:48.0

Homepage:

Size: 54

Language: C

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Levenshtein_search

By Matt Anderson. 2016-2017

![Linux build](https://img.shields.io/travis/mattandahalfew/Levenshtein_search.svg?style=flat-square&label=Linux build)![Mac OS X build](https://img.shields.io/travis/mattandahalfew/Levenshtein_search.svg?style=flat-square&label=Mac OS X build)![Windows build](https://img.shields.io/appveyor/ci/mattandahalfew/levenshtein-search/master.svg?style=flat-square&label=Windows build)

Levenshtein_search is a Python module that stores any number of documents as ternary search trees. It performs fuzzy searches for words in a document that are d distance away from a query word. Searches can also be used in conjunction with TF-IDF calculations. The term frequency (TF) is computed for each approximately matching word in the document, as well as the Levenshtein distance from your query word. The module was written in C and increases search speed by using graph search algorithms and minimizing the number of redundant comparisons.

Usage
rt Levenshtein_search

rpt1 = ["We","went","to","the","fire","Mother","said","Is","he","cold","Versh","Nome","Versh","said","Take","his","overcoat","and","overshoes","off","Mother","said","How","many","times","do","I","have","to","tell","you","not","to","bring","him","into","the","house","with","his","overshoes","on"]
rpt2 = ["Yessum","Versh","said","Hold","still","now","He","took","my","overshoes","off","and","unbuttoned","my","coat","Caddy","said","Wait","Versh","Cant","he","go","out","again","Mother","I","want","him","to","go","with","me","Youd","better","leave","him","here","Uncle","Maury","said","Hes","been","out","enough","today"]

t_wordset = Levenshtein_search.populate_wordset(-1,excerpt1)
rst_wordset = 0
_wordset = Levenshtein_search.populate_wordset(-1,excerpt2)
st_wordset = 1

The module accepts documents as Python lists of strings. To create a new document and give it a set of words, use the populate_wordset(x,excerpt1) function where x is an integer representing the document's index. If you would like the new document's index to be assigned, x should equal -1 and the function will return the new document's index, starting with 0. If you would like to add words to a preexisting document, x should equal that document's index. In the example above, excerpt1 is the Python list of strings.

"overcoat"
ist = 4
lts1 = Levenshtein_search.lookup(first_wordset,q,maxdist)
lts2 = Levenshtein_search.lookup(last_wordset,q,maxdist)

To search a document for your query word, use the lookup(x,q,maxdist) function where x is a non-negative integer representing the document's index, q is a string representing your query word, and maxdist is a non-negative integer representing the maximum allowable Levenshtein distance from your query word.

= Levenshtein_search.add_string(first_wordset,"coat")
dx = Levenshtein_search.remove_string(first_wordset,"coat")
dx==rm_idx) is true
nshtein_search.remove_string(first_wordset,q)

lts3 = Levenshtein_search.lookup(first_wordset,q,maxdist)

You can add or remove words from the document on an individual basis, using add_string and remove_string functions. The first argument is the document's index and the second is the word to add or remove.

nshtein_search.clear_wordset(last_wordset)

To clear a document from memory, use the clear_wordset(x) function where x is a non-negative integer representing the document's index. After clearing a document, x, documents retain their previous index (this is different from v1.3).

Output
t("Query word: %s" % q)
i in range(0,len(results1)):
print("%s" % results1[i][0]+": "+ str(results1[i][1])+", "+str(results1[i][2]))

t("Query word: %s" % q)
i in range(0,len(results2)):
print("%s" % results2[i][0]+": "+ str(results2[i][1])+", "+str(results2[i][2]))

t("Query word: %s" % q)
i in range(0,len(results3)):
print("%s" % results3[i][0]+": "+ str(results3[i][1])+", "+str(results3[i][2]))

The output of the lookup() function is a list of lists equal in length to the number of unique words returned by the search. For each unique word, the function retrieves the Levenshtein distance from your query word, and the term frequency of that word in the document.

y word: overcoat
coat: 0, 0.0238095238095
shoes: 4, 0.047619047619
y word: overcoat
: 4, 0.0222222222222
shoes: 4, 0.0222222222222
y word: overcoat
shoes: 4, 0.0487804878049

If there is any word in the results that is the same as your query word, it is guaranteed to be the first item of the result list. All other words are in no particular order.

Installation

pip install Levenshtein-search

Preliminary benchmarking

In a timed test of this module with Levenshtein edit distance of 2, it was approx. 10x faster than using an equivalent search in PostgreSQL.

ct name from restaurant_nophone_training where levenshtein_less_equal(name, '"philippe the original"', 2) <= 2;
198695853199 sec
philippe\'s the original"',), ('"philippe the original"',)]

nshtein_search algorithm:
019705373871 sec
philippe the original"', 0, 0.0011574074074074073], ['"philippe\'s the original"', 2, 0.0011574074074074073]]
Upcoming

Specialized O(number of nodes) algorithm for large edit distances


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.