Building HiSLM: What Happens When You Try to Train a Language Model and Learn Everything the Hard Way
The idea was to build a Hindi Sentiment and Language Model — HiSLM — that could handle the specific way people actually write Hindi online: code-switched, romanized, emotionally inflected, and built out of a vocabulary that standard NLP pipelines handle badly.
The idea was correct. The execution took considerably longer than the idea.
Why Hindi-Specific, Why Now
There are about 600 million Hindi speakers. The fraction of NLP research conducted on Hindi, Hinglish, or any South Asian language is much smaller than that number suggests it should be.
The gap is not neutral. It means the models deployed to moderate, translate, and understand content from Hindi-speaking users are operating on assumptions baked in from English training data. Those assumptions fail in specific, replicable ways that are well-documented in the literature and not fixed yet in the systems that matter.
Building a Hindi-first model is not a curiosity project. It is filling in something that is missing.
The First Three Weeks
The first three weeks were mostly finding data and arguing with the data about what it was. Hindi text on the internet is not clean. It is code-switched, typo-heavy, script-inconsistent, and emotionally complex in ways that English sentiment datasets do not capture.
Standard preprocessing pipelines eat it and produce garbage. I spent approximately thirty hours on tokenization before the tokenizer stopped treating “नहीं” and “nahi” as unrelated tokens for the same word.
I spent approximately thirty hours on tokenization before the tokenizer stopped treating “नहीं” and “nahi” as unrelated tokens for the same word. This is the research equivalent of discovering that your ruler is measuring in different units depending on the day.
The Training Infrastructure Problem
I do not have a GPU cluster. I have a laptop with a 4GB GPU that runs warm enough to use as a heating pad in December. Google Colab gives you time-limited access to something better if you are patient and don’t time out.
Most serious ML research happens with resources I don’t have. The work I can do is shaped by the constraints I’m working in. This is true for most people who are not at large research labs. It is worth saying directly: the published results in NLP research are disproportionately produced by groups with access to compute that most students will never touch.
I am working with what I have. What I have is enough to make progress. It is not enough to match the benchmarks in the papers I am comparing against, and that is fine as long as I am honest about it.
What The Model Can Do
As of the latest checkpoint: sentiment classification on Hinglish text at accuracy meaningfully above baseline multilingual models on the evaluation set I constructed. The evaluation set is small. The result is real but not final.
The next phase is a larger evaluation corpus, adversarial examples (text designed to expose the failure modes the current architecture hasn’t seen), and a paper draft that is honest about what the model gets wrong as well as what it gets right.
What I Have Learned That Is Not On The Resume
Training a model from scratch teaches you things that fine-tuning does not. You see exactly what “training dynamics” means when your own training falls apart at step 4000. You develop opinions about learning rate schedulers that are hard to acquire from reading papers about them.
More than anything: research is mostly failure that is building toward something. The checkpoint that works is the result of the checkpoints that didn’t. None of the failed checkpoints appear in the published results. This is probably fine for the paper and probably misleading for everyone who reads the paper expecting a straight line.
The straight line is the presentation. The work is the zigzag.