Language Classification
Suppose that you run a company receiving enquiries by e-mail in several languages and you would like them to be automatically sorted and forwarded to the people that can handle them. Suppose that you want to restrict your searches only to the documents in a certain language. Suppose that you want to apply some dictionary-based algorithms to a text written in a language not known in advance. Then you might find a tool for language classification useful.
On this page you find a simple tool that determines the language of a text that you can type or paste in the area below. If your browser is javascript-enabled you can also select samples in Italian, German, or French. The tool returns the probability that the text is in English, Italian, German or French. These are the languages that form the knowledge base of the classification algorithm.
| Deutsch | English | Français | Italiano |
Note that
The classification algorithm is not based on a dictionary. This means that it is rather robust to misspellings, neologisms and invented words. A sample taken from the Jabberwocky poem by Lewis Carroll is correctly classified as English. These are invented words all right, but they are English invented words.
The algorithm assumes that the text is in one of the languages in the knowledge base. If the text you type in is in a different language, the result is not meaningful. It is interesting to note, though, that if you type in a Spanish text it will have higher probability of being classified as Italian than English, in accordance with the closer relation between the two Romance languages.
The languages in the knowledge base were chosen only because these are the languages that I am most familiar with. Any language can be added at will, including languages using a different script.