Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?
MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.
So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:
>>> import requests >>> url = 'http://api.meaningcloud.com/lang-1.1' >>> payload={'key': 'b49....................ee', ... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'} >>> >>> requests.post(url, data=payload).json() {'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']} >>>
If you look at the lang_list
list, the first one is sv
for Swedish.
If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.
Let's do the other ones too:
>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso' >>> # Portugese >>> requests.post(url, data=payload).json() {'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']} >>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu' >>> # Swahili >>> requests.post(url, data=payload).json() {'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}
The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.
UPDATE
Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.
I tried it on my nltk install and I have 14 languages that it can detect.
UPDATE 2
A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.
Comments