← All writing
Baselines · 7 min read · 20 Jun 2026

A tuned GRU beat LoRA-fine-tuned GPT-2, here's why

A 117-million-parameter transformer lost to a small recurrent network on every metric I measured. The headline is tempting but wrong. The real lesson is quieter: most "transformers win" results are really "the baseline was never tuned" results.

The setup

The task was open-domain dialogue generation on the Cornell Movie-Dialogs Corpus, about 53,107 conversational pairs. I built three models: a GRU-based sequence-to-sequence model, an LSTM variant, and a GPT-2 fine-tuned with LoRA. The expectation, going in, was the obvious one: the pretrained transformer should walk it.

What actually happened

The well-tuned GRU won on every metric, most starkly on perplexity, where lower is better.

ModelPerplexity (↓)Relative
GRU seq2seq (tuned)12.39best
GPT-2 + LoRA45.623.7× higher

The GRU's perplexity was 3.7× lower than the LoRA-fine-tuned GPT-2's, and its BLEU was roughly 2.8× higher. A model with a fraction of the parameters produced more coherent, more on-distribution replies.

Dialogue model metric comparison
Side-by-side metrics across the three models.

Why the small model won

Three reasons, none of which are "GRUs are secretly better than transformers."

The real lesson

This is not an argument against transformers. On a large, diverse corpus, or judged by human-rated coherence rather than corpus perplexity, GPT-2 might well pull ahead. The honest takeaway is about process:

The most useful thing the GRU did wasn't winning. It was forcing me to explain why it won, and that explanation is worth more than the leaderboard row.

Read the full dialogue case study →
← Previous: the UAV leakage story All writing →
×