Why I Stopped Using PyTorch Lightning
For two years I reached for PyTorch Lightning on every project. It promised to remove boilerplate, and it did — right up until the day the boilerplate was the only thing standing between me and a bug I could not see.
The abstraction tax
The trouble with a framework that owns your training loop is that it owns your training loop. When the loss curve goes flat at 3am, you are no longer debugging your code. You are debugging someone else’s idea of how your code should run.
The training loop is not boilerplate. It is the experiment.
A plain loop is fifteen lines you can read top to bottom:
for epoch in range(epochs):
for batch in loader:
opt.zero_grad()
loss = model(batch).loss
loss.backward()
opt.step()
What I switched to
Nothing exotic. Raw PyTorch, a thin train.py, and a single config file. The
result is more lines and far fewer surprises — which, when you are the only
person paged about the run, is the trade you want.