berkmancenter/mediacloud-lithuanian-stemmer

Name: mediacloud-lithuanian-stemmer

Owner: Berkman Klein Center for Internet & Society

Description: (Fork of) Snowball version of Porter stemmer for Lithuanian language

Forked from: tokenmill/snowball

Created: 2017-12-27 13:41:04.0

Updated: 2017-12-27 13:44:28.0

Pushed: 2015-07-13 20:13:25.0

Homepage: http://tokenmill.lt/

Size: 160

Language: null

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

snowball

Old version of Snowball version of Porter stemmer for Lithuanian language is in the file lithuanian.sbl.

New version is in the file conservative.sbl.

The difference between the new and old versions is that the new one is less aggressive. This means that there should be fewer words that are overstemmed.

The new stemmer was created with search applications in mind. Therefore, nouns are considered as more important then adjectives, verbs, etc. This means that some suffixes, such as -ut- like in 'kalakutas', are left untouched during stemming. On the other hand, this leaves some adjectives understemmed, e.g. 'sveikutis -> sveikut'. There will always be trade-offs.

NOTE:

Current stemmer version uses length of the string to prevent overstemming. Stemmer created with snowball* program extends org.tartarus.snowball.SnowballProgram class and gets length of the current string using Java's current.length() call.

Whereas Lucene 4.10.1 implements SnowballProgram in such a way that attribute current is private, therefore current.length() doesn't compile for Lucene. Workaround is to substitute current.length() with getCurrent().length() on line 589.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.