How to identify/classify what language a piece of text is

09 August 2016   0 comments   Misc. links, Python

Mind that age!

This blog post is 4 years old! Most likely, its content is outdated. Especially if it's technical.

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = ''
>>> payload={'key': '',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>>, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>>, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.


Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.


A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.


Your email will never ever be published

Related posts