How to identify/classify what language a piece of text is

Tuesday, Aug 9, 2016
0 comments Misc. links, Python

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = 'http://api.meaningcloud.com/lang-1.1'
>>> payload={'key': 'b49....................ee',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}
>>>

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.

UPDATE

Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.

UPDATE 2

A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.

Comments

Previous:: json-schema-reducer August 2, 2016 Python
Next:: django-html-validator - now locally, fast! August 12, 2016 Python, Web development, Django

Related by category:: Find song by lyrics June 1, 2004 Misc. links; A Python dict that can report which keys you did not use June 12, 2025 Python; Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python

How to identify/classify what language a piece of text is

Comments

Related posts