Train, Validation, Test: Why Three Splits Not Two
A model that never gets touched by your decisions is the only honest judge of how it will perform. Here is why two splits are not enough to get that judge.
The quiet problem with two splits
Most people learn evaluation the same way: split your data into train and test, fit on train, score on test, done. It feels complete. You never touched the test data during training, so surely the score is honest. The trouble is that fitting a model is not the only decision you make. You also choose a learning rate, a regularisation strength, the depth of a tree, the number of layers, which features to keep. Every one of those choices needs to be evaluated against something, and if that something is your test set, the test set stops being untouched.
This is the part that catches people out, because it does not feel like cheating. You are not looking at test labels directly or training on test rows. You are just checking which hyperparameter setting scores best on test, then reporting that best score as your final result. But that process is a search, and a search that uses the test set to pick a winner has, by construction, tuned itself to that specific set of examples. The number you report is now optimistic, sometimes only slightly, sometimes substantially, and you have no way of knowing which without a third, genuinely untouched set.
The three-way split exists to separate two different jobs that people conflate: model selection and model assessment. Selection is the process of comparing candidates and picking a winner. Assessment is reporting how well the winner will do on new data. If you use the same data for both jobs, your assessment is contaminated by the selection process, because the winner was, in part, chosen for doing well on that exact data.
A worked example with numbers
Suppose you are building a classifier and you have 10,000 labelled examples. You try five different regularisation strengths, train each on a training split, and evaluate all five on a test split of 2,000 examples. Setting C performs best, at 91.4 percent accuracy. You report 91.4 percent as your model's expected performance.
Here is the issue: with 2,000 test examples, there is sampling noise in that accuracy figure, easily a percentage point or two in either direction just from which examples happened to land in the test set. When you try five candidates and pick the best one on that same 2,000-example set, you are effectively asking, out of five noisy estimates, which one got lucky. The winner is more likely to be a setting that both performs well and happened to benefit from favourable noise on this particular test set. Report that number as if it reflects true future performance and you have baked in an upward bias, often small, sometimes not.
Now do it properly. Split into 6,000 train, 2,000 validation, 2,000 test. Try your five candidates, evaluate each on validation, pick the winner based on validation accuracy, say it scores 91.4 percent there. Then, and only then, run that single chosen model once on the test set. Suppose test accuracy comes back at 89.7 percent. That drop is not a mistake; it is the honest cost of the optimism that crept into the validation number during selection. The test score, touched exactly once, is your genuine estimate of future performance, because nothing about the model was chosen to please it.
The gap between validation and test performance is itself informative. A small gap suggests your validation set was a reasonable proxy and your selection process was not overly aggressive. A large gap suggests you tried too many candidates relative to your validation set size, or that your validation set is too small to support fine-grained comparisons, and you should treat future validation-based decisions with more scepticism.

Where this goes wrong in practice
The most common failure I see is early stopping tied to the test set. Someone trains a neural network, checks test accuracy every few epochs, and stops when it peaks. That peak epoch was chosen because it looked best on test data, which means the reported accuracy at that epoch is inflated for exactly the same reason as the hyperparameter example above. Early stopping needs a validation set, full stop, with test only touched for the final, single, post-hoc measurement.
Another version of this creeps in through feature engineering. If you compute normalisation statistics, encode categorical variables, or select features using information from the full dataset before splitting, information from your test set has leaked into your training pipeline even though no test labels were used for fitting the model itself. Any preprocessing step that learns something from data, a mean, a vocabulary, a set of important features, must be fit on the training split alone and then applied unchanged to validation and test.
A subtler issue arises with repeated experimentation over time. If you keep a fixed test set and evaluate on it every time you try a new idea over the course of a project, you are running the same selection-on-test problem across weeks or months rather than across five hyperparameters in one sitting. The test set slowly becomes a second validation set through repeated peeking, and the final number you report loses its meaning. Discipline here means deciding in advance how many times you will touch the test set, ideally once, and doing all iteration against validation.
None of this means three splits solve every problem. With small datasets, carving out a validation set as well as a test set can leave too little data to train on, which is where k-fold cross-validation earns its keep, using multiple train and validation splits and reserving a separate held-out test set for the final check. The principle survives even when the mechanics change: keep a boundary between the data that shapes your decisions and the data that judges the outcome, and treat that boundary as something to protect rather than something you will get around to later.
The practical takeaway
Before you write a line of training code, decide which data you will use to fit parameters, which you will use to choose between candidates, and which you will touch exactly once at the very end to report a number you will stand behind. Write that plan down. If you catch yourself checking test performance more than once, or letting a preprocessing step see the whole dataset before splitting, stop and fix the pipeline rather than the number. A slightly lower, honest score is worth far more than a slightly higher one you cannot trust, because the whole point of evaluation is to know, before deployment, how the model will actually behave.
