twitter/twitter-korean-text

Name: twitter-korean-text

Owner: Twitter, Inc.

Description: Korean tokenizer

Created: 2014-10-29 21:16:33.0

Updated: 2018-01-10 09:19:27.0

Pushed: 2017-06-26 07:12:52.0

Homepage: null

Size: 28931

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

twitter-korean-text Coverage Status

????? ?? ???? ??? ???

Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

???? ??? ??? ??????. ?? ??? ???? ??? ??, ???? ???? ????. ?? ??? ???? ? ?? ??? ? ????. ??? ????? ?? ?? Google Forum? ??? ???. ???? ??? ??? ???? ??? ???? ??? ???? ?? ?????.

twitter-korean-text? ??? ???? ??? ??? ??? ??? ?? ???? ???? ?? ????. ??? ??? ??? ??? ????? ????.

twitter-korean-text? normalization, tokenization, stemming, phrase extraction ??? ??? ??? ?????.

??? normalization (????? -> ??? ??, ??? -> ???)

??? tokenization

??? stemming (??? -> ??)

?? ?? phrase extraction

Introductory Presentation: Google Slides

Try it here

Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/

Gunja Agrawal?? ????? ??? ? ??? ???. http://gunjaagrawal.com/langhack/

Opensourced here: twitter-korean-tokenizer-api

API

scaladoc

mavendoc

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml:

Maven? ??? ?? pom.xml? ??? ??? ????? ???:

ependency>
<groupId>com.twitter.penguin</groupId>
<artifactId>korean-text</artifactId>
<version>4.4</version>
dependency>

The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/

Support for other languages.
.net

modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS

node.js

Ch0p kindly offered a node.js wrapper: twtkrjs

Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text

Python

Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py

Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean

Ruby

jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby

Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby

Elastic Search

socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch

Get the source ??? ???? ??

Clone the git repo and build using maven.

Git ??? ???? Maven? ???? ?????.

clone https://github.com/twitter/twitter-korean-text.git
witter-korean-text
compile

Open 'pom.xml' from your favorite IDE.

Usage ?? ??

You can find these examples in examples folder.

examples ??? ?? ?? ?? ??? ????.

from Scala

rt com.twitter.penguin.korean.TwitterKoreanProcessor
rt com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
rt com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken

ct ScalaTwitterKoreanTextExample {
f main(args: Array[String]) {
val text = "???? ???? ?????????? #???"

// Normalize
val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
println(normalized)
// ???? ???? ??????? #???

// Tokenize
val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
println(tokens)
// List(???(Noun: 0, 3), ?(Josa: 3, 1),  (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2),  (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 2), ?(Eomi: 14, 1), ??(KoreanParticle: 15, 2),  (Space: 17, 1), #???(Hashtag: 18, 4))

// Stemming
val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)

println(stemmed)
// List(???(Noun: 0, 3), ?(Josa: 3, 1),  (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2),  (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 3), ??(KoreanParticle: 15, 2),  (Space: 17, 1), #???(Hashtag: 18, 4))

// Phrase extraction
val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
println(phrases)
// List(???(Noun: 0, 3), ??(Noun: 5, 2), ???? ??(Noun: 5, 7), ??(Noun: 10, 2), #???(Hashtag: 18, 4))


from Java

rt java.util.List;

rt scala.collection.Seq;

rt com.twitter.penguin.korean.TwitterKoreanProcessor;
rt com.twitter.penguin.korean.TwitterKoreanProcessorJava;
rt com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
rt com.twitter.penguin.korean.tokenizer.KoreanTokenizer;

ic class JavaTwitterKoreanTextExample {
blic static void main(String[] args) {
String text = "???? ???? ?????????? #???";

// Normalize
CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
System.out.println(normalized);
// ???? ???? ??????? #???


// Tokenize
Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
// [???, ?, ??, ??, ??, ??, ?, ??, #???]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
// [???(Noun: 0, 3), ?(Josa: 3, 1),  (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2),  (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 2), ?(Eomi: 14, 1), ??(KoreanParticle: 15, 2),  (Space: 17, 1), #???(Hashtag: 18, 4)]


// Stemming
Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
// [???, ?, ??, ??, ??, ??, ??, #???]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
// [???(Noun: 0, 3), ?(Josa: 3, 1),  (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2),  (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 3), ??(KoreanParticle: 15, 2),  (Space: 17, 1), #???(Hashtag: 18, 4)]


// Phrase extraction
List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
System.out.println(phrases);
// [???(Noun: 0, 3), ??(Noun: 5, 2), ???? ??(Noun: 5, 7), ??(Noun: 10, 2), #???(Hashtag: 18, 4)]



Basics

TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.

TwitterKoreanProcessor.scala? ???? ?? ??? ?? ?????.

Running Tests

mvn test will run our unit tests

?? ?? ???? ????? mvn test? ??? ???.

Tools

We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

?? ? ???? ?? ?? ??

Performance ?? ??

Tested on Intel i7 2.3 Ghz

Initial loading time (?? ?? ??): 2~4 sec

Average time per parsing a chunk (?? ?? ?? ??): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets|100K|200K|300K|400K|500K|600K|700K|800K|900K|1M —|—|—|—|—|—|—|—|—|—|— Time in Seconds|57.59|112.09|165.05|218.11|270.54|328.52|381.09|439.71|492.94|542.12 Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

Benchmark test

From http://konlpy.org/ko/v0.4.2/morph/

Author(s)
License

Copyright 2014 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.