How to identify/classify what language a piece of text is

09 August 2016   0 comments   Misc. links, Python

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = 'http://api.meaningcloud.com/lang-1.1'
>>> payload={'key': 'b49....................ee',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}
>>>

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.

UPDATE

Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.

UPDATE 2

A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.

Comments

Your email will never ever be published


Related posts

Previous:
json-schema-reducer 02 August 2016
Next:
django-html-validator - now locally, fast! 12 August 2016
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010