What Happens When You Train a Toxicity Classifier on English Data and Ask It About Hindi-Hinglish?

The model flagged the sentence as toxic. The sentence was a compliment. In Hindi, delivered the way it was written — casual, code-switched, deeply normal for the 600 million people who communicate this way — it reads as warm. The classifier did not know this. The classifier had never been asked to know this.

This is the problem.

The Benchmark Problem

Most toxicity classifiers are trained on English datasets. Most evaluation benchmarks measure performance on English text. This makes sense: English is the dominant language of the internet, of ML research, of the labs building these systems.

It also means that when the same systems are deployed in contexts where Hindi, Hinglish, or code-switched South Asian languages are primary, they carry English-tuned intuitions about what constitutes offense — and those intuitions are wrong in specific, systematic ways.

The model had never been asked to understand code-switching. It had been asked to classify toxicity. It was doing its best with the only world it knew.

What We Found

Working on quantifying bias in Hindi and Hinglish toxicity classification, the pattern that emerged was not random noise. Certain categories of miscalibration appeared consistently:

False positives on code-switching: Sentences mixing Hindi and English at the lexical level flagged more often than equivalent English sentences with identical semantic content. The model interpreted unfamiliarity as suspicion.

Script sensitivity: Devanagari-script text versus romanized Hindi (“Hinglish”) produced different classification outputs for equivalent content. The model had no way to know the scripts encoded the same language.

Cultural context collapse: Idioms, honorifics, and forms of address that carry specific meaning in Hindi-speaking contexts were flattened by classifiers trained to read surface-level aggression markers.

Why This Matters Beyond Research

Content moderation is not neutral infrastructure. When a classifier systematically misreads the communication patterns of a linguistic community, the community bears the cost: false removal of legitimate content, under-enforcement against actual harm that reads as benign in the dominant language’s frame, and a general sense that the platform was not built for you.

This is a fairness problem. It is also an engineering problem. It has tractable solutions. It mostly gets skipped because the benchmark numbers look fine on English.

What “Looks Fine” Actually Means

A classifier that achieves 92% accuracy on a standard benchmark and 74% accuracy on Hinglish text is not a 92% accurate system. It is a system with a 92% English accuracy and a 74% Hinglish accuracy, being reported as the first number and deployed everywhere.

The gap is not noise. The gap is whose experience was considered during evaluation.

What Comes Next

Better multilingual training data. Targeted evaluation benchmarks that include code-switched text. Classifiers evaluated on the distributions they will actually encounter in deployment, not the distributions that make the paper numbers clean.

None of this is novel as a research direction. Most of it exists as recommendations in papers that describe the problem. The distance between “here is the problem” and “we fixed the problem” in ML research is sometimes a single lab decision and sometimes an entire field’s priorities.

We are working on the distance.