← All case studies
Natural Language Processing · seq2seq · LoRA

Dialogue Generation: GRU vs LSTM vs GPT-2

Does a fine-tuned transformer always beat a recurrent baseline? On a modest dialogue dataset under limited compute, a well-tuned GRU won on every metric.

PyTorchHugging Face TransformersPEFT / LoRAseq2seq + attention2026
0
dialogue pairs
0
GRU perplexity (best)
0
lower perplexity than GPT-2
0
higher BLEU than GPT-2

The problem

Open-domain dialogue response generation on the Cornell Movie-Dialogs Corpus — 53,107 utterance pairs (47,797 train / 5,310 test), vocabulary 7,822, max length 10 tokens. The question: with limited data and a single GPU, does parameter-efficient fine-tuning of a large pretrained model beat a purpose-built small one?

Approach

Results

ModelPerplexity ↓BLEU ↑Embed similarity ↑
GRU seq2seq12.390.03350.402
LSTM seq2seq13.800.03050.349
GPT-2 (LoRA)45.620.01190.229

The GRU swept every metric: 3.7× lower perplexity and 2.8× higher BLEU than the LoRA-tuned GPT-2, with far fewer trainable parameters. The GRU also edged the LSTM, reaching a lower final training loss (2.36 vs 2.60).

Metric comparison across the three models
BLEU / similarity / perplexity across models.
Training loss curves
Training-loss curves: GRU vs LSTM.
All-models comparison
Side-by-side model comparison.
Score distributions
Per-response score distributions.

Why the small model won

LoRA freezes most of GPT-2 and trains a small adapter — powerful when the base model already covers your domain, but movie dialogue is short, idiosyncratic and far from GPT-2's web-text prior. A from-scratch encoder–decoder, trained end-to-end on exactly this distribution, fit it better. Bigger and pretrained is not automatically better when the data is narrow and the compute is fixed.

What I took away

View repository on GitHub →

← Previous: UAV Classification Next: Robot Localization →
×