Why do short texts ("hello") almost always fail?

Because **trigram statistics need data points**. A five-character word gives you three trigrams. A ten-character message gives you eight. With that few samples you cannot tell English from Dutch, both have "the" / "het" patterns and a lot of shared roots. The library returns the best guess it can, but the **confidence is meaningless** at that length. Realistic rule of thumb: **30 characters is the bare minimum**, **100 characters is solid**, **500+ characters is essentially bulletproof** for any well-supported language.

What is the difference between ISO 639-1 and ISO 639-3?

Both are language code standards from the **International Organization for Standardization**, but they cover different sets. **ISO 639-1** uses two letters ("en", "pl", "de") and covers only the most widely used languages, around 184. **ISO 639-3** uses three letters ("eng", "pol", "deu") and covers virtually every language on earth, more than 7,000 entries. franc-min returns ISO 639-3 because it supports languages that have no 639-1 code (such as many regional or minority languages). We show both when a 639-1 code exists so you can copy whichever your downstream system expects.

What happens if the text is mixed (two languages in one block)?

The detector picks the **dominant language**, the one with the most matching trigrams. If you paste an English email with a single Polish quote, you get English back. If you paste a near-50/50 mix, the top-5 list will show both languages with similar confidence, that is the signal that the input is mixed. The tool **cannot** split a document into per-paragraph languages, that requires a more sophisticated segmentation step which franc-min does not do.

Why does it confuse Czech and Slovak (or Norwegian and Danish)?

Because these languages **share most of their trigram space**. Czech and Slovak have nearly identical phonotactics, very similar root vocabulary, and overlapping function words. From the detector's point of view they look like dialects of the same language. The same is true of **Norwegian Bokmal vs Danish** (the written forms are extremely close), **Serbian vs Croatian vs Bosnian**, **Indonesian vs Malay**, and to a lesser extent **Spanish vs Portuguese**. When the gap between the top-1 and top-2 candidates is small, treat the result as "one of these two", not as a single answer.

How accurate is this in practice?

For **well-supported languages on inputs over 100 characters**, accuracy is typically **above 95%**. For inputs between 30 and 100 characters it drops to **85-92%** depending on the language. Below 30 characters it falls off a cliff into **50-70%** territory. Short tweets, single-word queries, file names, and code snippets are notoriously hard. Long, natural prose in a major language is essentially always correct. The library is the same one used by **GitHub Linguist** (the thing that decides "this repo is mostly Python") and **various i18n tooling**, so it has been battle-tested on a lot of real text.

Why does my technical or code-mixed writing get the wrong language?

Because technical writing in any language tends to **borrow heavily from English** (variable names, function names, API endpoints, error messages). A Polish blog post about React with code snippets, command outputs and English error messages might score higher for English than for Polish, even when the prose is clearly Polish. The detector is honest about what it sees, the trigram distribution really does lean English in that case. If you want to detect the **author's language**, strip code and English borrowings first, or rely on a longer prose-only paragraph.

Is my text sent to any external service?

**No, never**. franc-min is a self-contained Node library, the reference profiles are part of the package. Our API route runs the detection **in the same process** that serves the page, then returns the result. We do not log the text, we do not store it, we do not forward it anywhere. The full pipeline is "browser -> our server -> franc-min -> back". Compare that with cloud language APIs, which would send your input to Google or Azure for billing and analytics.

Which languages are supported?

franc-min recognises **about 82 of the most common languages**, the ones with at least 1 million native speakers. That covers English, Spanish, Mandarin, Hindi, Arabic, Portuguese, Bengali, Russian, Japanese, Punjabi, German, Korean, French, Vietnamese, Turkish, Italian, Polish, Ukrainian, Persian, Romanian, Dutch, Hungarian, Greek, Czech, Swedish, Bulgarian, Danish, Finnish, Slovak, Croatian, Serbian, Bosnian, Slovenian, Norwegian, Hebrew, Thai, Indonesian, Malay, Tagalog, Swahili, Zulu, Afrikaans, Amharic, Hausa, Yoruba, Igbo, Somali, Georgian, Armenian, Azerbaijani, Kazakh, Uzbek, Mongolian, Nepali, Sinhala, Burmese, Khmer, Lao, and many more. The full reference profile is available in the franc-min repository.

Can I trust the confidence percentage?

It is a **relative score**, not a probability. A confidence of 100% means "this language was the best match by a clear margin", a confidence of 50% means "the best match was barely better than the next one". The detector **always** returns its best guess, even on garbage input, so a low confidence is your warning that the result is shaky. The honest interpretation: above **85%** treat it as reliable, **50-85%** sanity-check by looking at the top-2 candidate, below **50%** assume the input is too short or too noisy to detect cleanly.

Language Detection - free

What language is this text?

Paste any block of text and this tool tells you what language it is, with a confidence score and the top five matching candidates. It uses franc-min, a small Node library that recognises over 80 languages through a pure statistical method: it splits text into three-letter chunks called trigrams, counts how often each one appears, and compares those frequencies against reference profiles built from real language samples.

Everything runs on our server in plain JavaScript. No machine-learning model, no external API, no data leaves our infrastructure beyond the request itself. We do not store the text you submit.

Two important things to know up front. Short input fails: under twenty characters the trigram statistics are basically noise, so the answer can flip language with one extra word. And closely related languages confuse the detector: Czech and Slovak share so many trigrams that a short Czech sentence sometimes scores higher for Slovak. Always look at the top-5 list before treating the headline result as gospel.

How to use it

Paste your text in the input box. Anything counts, an email, a paragraph, a chat message, a tweet.
Try the sample chips under the box if you want to see how detection behaves on English, Polish, German, Japanese and Arabic.
Click "Detect language". The result comes back in under a hundred milliseconds because nothing leaves our server.
Read the primary verdict: the detected language name, its flag, the ISO 639-3 three-letter code and the ISO 639-1 two-letter code (when one exists).
Glance at the confidence percentage: anything above 85% is solid, 50-85% means the input is short or shares trigrams with another language, below 50% means the result is unreliable.
Open the top-5 candidates below. If the second candidate is within a few percent of the first, your text might be a mix or one of the famous "look-alike" pairs (Czech / Slovak, Norwegian / Danish, Spanish / Portuguese).
For mixed-language text (an English email with one Polish quote, for example) expect the detector to pick the dominant language, not to split the result.

When this is useful

Five honest, day-to-day uses for a quick language detector:

Triage incoming support emails or contact-form messages before routing them. Drop the body in, see if it is English, Polish, German, etc., then forward to the right team. Faster than guessing from a name or a domain.
Audit a content database before running translation jobs. Paste a sample row in, confirm the language matches what the column says it should be. Catches mis-tagged rows that would otherwise be sent to the wrong translator.
Quickly identify a snippet you found in logs, in an old document, in a screenshot OCR result, when you have no idea what language it is. Detection plus the flag is usually enough to know where to look next.
Sanity-check generated content when an LLM is supposed to reply in a specific language and you suspect it answered in English by mistake. Paste, see the iso3 code, done.
Teach how trigram detection works. The top-5 list with bars is a great visual aid because you can see *how close* Czech is to Slovak or Portuguese is to Spanish in trigram space.

Questions and answers

A trigram is a three-letter sequence, like "the", "ion", "ing" in English or "nie", "cie", "tej" in Polish. Every language has a characteristic frequency table: in English "the" and "and" are absurdly common, in Polish "nie" and "cie" stand out, in German the trigram "sch" hits much harder than anywhere else. franc-min ships precomputed reference profiles for each supported language. When you paste text, the library extracts your trigrams, counts them, and measures the distance between your distribution and each language profile. The smallest distance wins. No machine learning, no neural network, no training step on our side, the reference data is part of the library.

What language is this text?

Everything runs on our server in plain JavaScript. No machine-learning model, no external API, no data leaves our infrastructure beyond the request itself. We do not store the text you submit.

How to use it

Paste your text in the input box. Anything counts, an email, a paragraph, a chat message, a tweet.

Try the sample chips under the box if you want to see how detection behaves on English, Polish, German, Japanese and Arabic.

Click "Detect language". The result comes back in under a hundred milliseconds because nothing leaves our server.

Read the primary verdict: the detected language name, its flag, the ISO 639-3 three-letter code and the ISO 639-1 two-letter code (when one exists).

Glance at the confidence percentage: anything above 85% is solid, 50-85% means the input is short or shares trigrams with another language, below 50% means the result is unreliable.

Open the top-5 candidates below. If the second candidate is within a few percent of the first, your text might be a mix or one of the famous "look-alike" pairs (Czech / Slovak, Norwegian / Danish, Spanish / Portuguese).

For mixed-language text (an English email with one Polish quote, for example) expect the detector to pick the dominant language, not to split the result.

When this is useful

Five honest, day-to-day uses for a quick language detector:

Triage incoming support emails or contact-form messages before routing them. Drop the body in, see if it is English, Polish, German, etc., then forward to the right team. Faster than guessing from a name or a domain.
Audit a content database before running translation jobs. Paste a sample row in, confirm the language matches what the column says it should be. Catches mis-tagged rows that would otherwise be sent to the wrong translator.
Quickly identify a snippet you found in logs, in an old document, in a screenshot OCR result, when you have no idea what language it is. Detection plus the flag is usually enough to know where to look next.
Sanity-check generated content when an LLM is supposed to reply in a specific language and you suspect it answered in English by mistake. Paste, see the iso3 code, done.
Teach how trigram detection works. The top-5 list with bars is a great visual aid because you can see *how close* Czech is to Slovak or Portuguese is to Spanish in trigram space.

Questions and answers

Language Detection

What language is this text?

How to use it

When this is useful

Questions and answers

Related tools

AI Text Detector

Word and Character Counter

Case Converter

Mail Header Analyzer

LLM token counter

Language Detection

What language is this text?

How to use it

When this is useful

Questions and answers

Related tools

AI Text Detector

Word and Character Counter

Case Converter

Mail Header Analyzer

LLM token counter