ropensci/cld2

Name: cld2

Owner: rOpenSci

Description: R Wrapper for Google's Compact Language Detector 2

Created: 2017-06-02 21:56:32.0

Updated: 2017-10-10 08:41:11.0

Pushed: 2017-10-15 14:16:57.0

Homepage: null

Size: 30529

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

cld2

R Wrapper for Google's Compact Language Detector 2

Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge CRAN RStudio mirror downloads Github Stars

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes)

Installation

This package includes a bundled version of libcld2:

ools::install_github("ropensci/cld2")
Guess a Language

The function detect_language() returns the best guess or NA if the language could not reliablity be determined.

::detect_language("To be or not to be")
] "ENGLISH"

::detect_language("Ce n'est pas grave.")
] "FRENCH"

::detect_language("Nou breekt mijn klomp!")
] "DUTCH"

Set plain_text = FALSE if your input contains HTML:

::detect_language(url('http://www.un.org/ar/universal-declaration-human-rights/'), plain_text = FALSE)
] "ARABIC"

::detect_language(url('http://www.un.org/zh/universal-declaration-human-rights/'), plain_text = FALSE)
] "CHINESE"

Use detect_language_multi() to get detailed classification output.

ct_language_multi(url('http://www.un.org/fr/universal-declaration-human-rights/'), plain_text = FALSE)
lassificaton
language code latin proportion
  FRENCH   fr  TRUE       0.96
 ENGLISH   en  TRUE       0.03
  ARABIC   ar FALSE       0.00

ytes
] 17008

eliabale
] TRUE

This shows the top 3 language guesses and the proportion of text that was classified as this language. The bytes attribute shows the total number of text bytes that was classified, and reliable is a complex calculation on if the #1 language is some amount more probable then the second-best Language.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.