← Back to home
Writing

Notes on honest evaluation

Short write-ups on the part of machine learning that doesn't fit in a leaderboard: leakage, baselines, and why a good number is usually wrong until proven otherwise.

UAV confusion matrix
Evaluation · 6 min read

How a naïve train/test split inflated my UAV F1 by 0.5

The same model, the same data, and a macro-F1 that fell from 0.78 to 0.26 the moment I split the data honestly. A walk through the most expensive bug in ML.

12 Jun 2026 Read →
NLP metric comparison
Baselines · 7 min read

A tuned GRU beat LoRA-fine-tuned GPT-2, here's why

A 117M-parameter transformer lost to a small recurrent net on every metric. Not because transformers are bad, but because the baseline was done properly.

20 Jun 2026 Read →
×