Name: twitter-korean-text
Owner: Twitter, Inc.
Description: Korean tokenizer
Created: 2014-10-29 21:16:33.0
Updated: 2018-01-10 09:19:27.0
Pushed: 2017-06-26 07:12:52.0
Homepage: null
Size: 28931
Language: Scala
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
????? ?? ???? ??? ???
Scala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.
???? ??? ??? ??????. ?? ??? ???? ??? ??, ???? ???? ????. ?? ??? ???? ? ?? ??? ? ????. ??? ????? ?? ?? Google Forum? ??? ???. ???? ??? ??? ???? ??? ???? ??? ???? ?? ?????.
twitter-korean-text? ??? ???? ??? ??? ??? ??? ?? ???? ???? ?? ????. ??? ??? ??? ??? ????? ????.
twitter-korean-text? normalization, tokenization, stemming, phrase extraction ??? ??? ??? ?????.
??? normalization (????? -> ??? ??, ??? -> ???)
??? tokenization
??? stemming (??? -> ??)
?? ?? phrase extraction
Introductory Presentation: Google Slides
Gunja Agrawal kindly created a test API webpage for this project: http://gunjaagrawal.com/langhack/
Gunja Agrawal?? ????? ??? ? ??? ???. http://gunjaagrawal.com/langhack/
Opensourced here: twitter-korean-tokenizer-api
To include this in your Maven-based JVM project, add the following lines to your pom.xml:
Maven? ??? ?? pom.xml? ??? ??? ????? ???:
ependency>
<groupId>com.twitter.penguin</groupId>
<artifactId>korean-text</artifactId>
<version>4.4</version>
dependency>
The maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/
modamoda kindly offered a .net wrapper: https://github.com/modamoda/TwitterKoreanProcessorCS
Ch0p kindly offered a node.js wrapper: twtkrjs
Youngrok Kim kindly offered a node.js wrapper: node-twitter-korean-text
Baeg-il Kim kindly offered a Python version: https://github.com/cedar101/twitter-korean-py
Jaepil Jeong kindly offered a Python wrapper: https://github.com/jaepil/twkorean
jun85664396 kindly offered a Ruby wrapper: twitter-korean-text-ruby
Jaehyun Shin kindly offered a Ruby wrapper: twitter-korean-text-ruby
socurites's Korean analyzer for elasticsearch based on twitter-korean-text: tkt-elasticsearch
Clone the git repo and build using maven.
Git ??? ???? Maven? ???? ?????.
clone https://github.com/twitter/twitter-korean-text.git
witter-korean-text
compile
Open 'pom.xml' from your favorite IDE.
You can find these examples in examples folder.
examples ??? ?? ?? ?? ??? ????.
from Scala
rt com.twitter.penguin.korean.TwitterKoreanProcessor
rt com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase
rt com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken
ct ScalaTwitterKoreanTextExample {
f main(args: Array[String]) {
val text = "???? ???? ?????????? #???"
// Normalize
val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)
println(normalized)
// ???? ???? ??????? #???
// Tokenize
val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)
println(tokens)
// List(???(Noun: 0, 3), ?(Josa: 3, 1), (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2), (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 2), ?(Eomi: 14, 1), ??(KoreanParticle: 15, 2), (Space: 17, 1), #???(Hashtag: 18, 4))
// Stemming
val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)
println(stemmed)
// List(???(Noun: 0, 3), ?(Josa: 3, 1), (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2), (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 3), ??(KoreanParticle: 15, 2), (Space: 17, 1), #???(Hashtag: 18, 4))
// Phrase extraction
val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)
println(phrases)
// List(???(Noun: 0, 3), ??(Noun: 5, 2), ???? ??(Noun: 5, 7), ??(Noun: 10, 2), #???(Hashtag: 18, 4))
from Java
rt java.util.List;
rt scala.collection.Seq;
rt com.twitter.penguin.korean.TwitterKoreanProcessor;
rt com.twitter.penguin.korean.TwitterKoreanProcessorJava;
rt com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;
rt com.twitter.penguin.korean.tokenizer.KoreanTokenizer;
ic class JavaTwitterKoreanTextExample {
blic static void main(String[] args) {
String text = "???? ???? ?????????? #???";
// Normalize
CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);
System.out.println(normalized);
// ???? ???? ??????? #???
// Tokenize
Seq<KoreanTokenizer.KoreanToken> tokens = TwitterKoreanProcessorJava.tokenize(normalized);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));
// [???, ?, ??, ??, ??, ??, ?, ??, #???]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));
// [???(Noun: 0, 3), ?(Josa: 3, 1), (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2), (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 2), ?(Eomi: 14, 1), ??(KoreanParticle: 15, 2), (Space: 17, 1), #???(Hashtag: 18, 4)]
// Stemming
Seq<KoreanTokenizer.KoreanToken> stemmed = TwitterKoreanProcessorJava.stem(tokens);
System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));
// [???, ?, ??, ??, ??, ??, ??, #???]
System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));
// [???(Noun: 0, 3), ?(Josa: 3, 1), (Space: 4, 1), ??(Noun: 5, 2), ??(Verb: 7, 2), (Space: 9, 1), ??(Noun: 10, 2), ??(Adjective: 12, 3), ??(KoreanParticle: 15, 2), (Space: 17, 1), #???(Hashtag: 18, 4)]
// Phrase extraction
List<KoreanPhraseExtractor.KoreanPhrase> phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);
System.out.println(phrases);
// [???(Noun: 0, 3), ??(Noun: 5, 2), ???? ??(Noun: 5, 7), ??(Noun: 10, 2), #???(Hashtag: 18, 4)]
TwitterKoreanProcessor.scala is the central object that provides the interface for all the features.
TwitterKoreanProcessor.scala? ???? ?? ??? ?? ?????.
mvn test
will run our unit tests
?? ?? ???? ????? mvn test
? ??? ???.
We provide tools for quality assurance and test resources. They can be found under src/main/scala/com/twitter/penguin/korean/qa and src/main/scala/com/twitter/penguin/korean/tools.
Refer to the general contribution guide. We will add this project-specific contribution guide later.
Tested on Intel i7 2.3 Ghz
Initial loading time (?? ?? ??): 2~4 sec
Average time per parsing a chunk (?? ?? ?? ??): 0.12 ms
Tweets (Avg length ~50 chars)
Tweets|100K|200K|300K|400K|500K|600K|700K|800K|900K|1M —|—|—|—|—|—|—|—|—|—|— Time in Seconds|57.59|112.09|165.05|218.11|270.54|328.52|381.09|439.71|492.94|542.12 Average per tweet: 0.54212 ms
Benchmark test by KoNLPy
From http://konlpy.org/ko/v0.4.2/morph/
Copyright 2014 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0