Name: cld2
Owner: rOpenSci
Description: R Wrapper for Google's Compact Language Detector 2
Created: 2017-06-02 21:56:32.0
Updated: 2017-10-10 08:41:11.0
Pushed: 2017-10-15 14:16:57.0
Homepage: null
Size: 30529
Language: C++
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes)
This package includes a bundled version of libcld2:
ools::install_github("ropensci/cld2")
The function detect_language()
returns the best guess or NA if the language could not reliablity be determined.
::detect_language("To be or not to be")
] "ENGLISH"
::detect_language("Ce n'est pas grave.")
] "FRENCH"
::detect_language("Nou breekt mijn klomp!")
] "DUTCH"
Set plain_text = FALSE
if your input contains HTML:
::detect_language(url('http://www.un.org/ar/universal-declaration-human-rights/'), plain_text = FALSE)
] "ARABIC"
::detect_language(url('http://www.un.org/zh/universal-declaration-human-rights/'), plain_text = FALSE)
] "CHINESE"
Use detect_language_multi()
to get detailed classification output.
ct_language_multi(url('http://www.un.org/fr/universal-declaration-human-rights/'), plain_text = FALSE)
lassificaton
language code latin proportion
FRENCH fr TRUE 0.96
ENGLISH en TRUE 0.03
ARABIC ar FALSE 0.00
ytes
] 17008
eliabale
] TRUE
This shows the top 3 language guesses and the proportion of text that was classified as this language.
The bytes
attribute shows the total number of text bytes that was classified, and reliable
is a
complex calculation on if the #1 language is some amount more probable then the second-best Language.