How I work
A model that scores well for the wrong reason is worse than one that scores honestly, because it fails silently in production. So before I trust a result, I try to break it.
On a UAV object-classification project, a naïve random train/test split gave a confidently high macro-F1. But consecutive frames of the same vehicle are near-duplicates, so random splitting leaked the test set into training. A track-aware split, keeping each object on one side of the divide, dropped F1 by roughly 0.5. That collapse is the real story: it's the difference between a number you can put in a report and one that would embarrass you in deployment.
The same discipline runs through everything here: strong baselines before big models (a tuned GRU beating a LoRA-fine-tuned GPT-2 on the Cornell Movie-Dialogs Corpus), the right metric for the data, and reproducibility, fixed seeds, logged configs, and runs tracked in Weights & Biases so every figure on this site can be regenerated.
