Machine Learning · 10 min read · 3 Jul 2026

Feature Scaling: Which Models Care and Which Do Not

Some algorithms are quietly ruined by unscaled features while others could not care less. Here is the reasoning behind the split, with worked numbers.

Why this question keeps coming up

Every so often I see a pipeline where someone has standardised every column before fitting a random forest, or skipped scaling entirely before fitting a support vector machine, and then wondered why performance looked odd. Feature scaling is one of those preprocessing steps that gets applied by habit rather than by reasoning. The habit is not wrong exactly, but it hides a genuinely useful distinction: some model families are mathematically sensitive to the scale of your inputs, and some are completely indifferent to it. Knowing which is which saves time, avoids silent bugs, and stops you from applying transformations that do nothing except add complexity to your pipeline.

The core idea is simple once you see it clearly. Models that rely on distances, dot products, or gradient based optimisation over raw feature values tend to be scale sensitive. Models that make decisions based on the order of values within a feature, rather than their magnitude relative to other features, tend to be scale invariant. That single distinction explains almost the entire landscape of when scaling matters.

This is not a purely academic point. In practice, forgetting to scale before fitting a linear model with regularisation can silently distort which features get penalised, while spending time scaling before fitting a decision tree is wasted effort that adds no value and can even obscure interpretability if you are not careful about undoing the transformation later. Getting this right is part of being rigorous about your pipeline rather than just running whatever transformer happens to be next in a tutorial.

The intuition behind scale sensitivity

Think about what a model actually does with the numbers you feed it. A k nearest neighbours classifier computes distances between points across all features simultaneously, usually using something like Euclidean distance. If one feature is measured in thousands and another in single digits, the feature with the larger numeric range will dominate the distance calculation almost entirely, regardless of whether it is actually more informative for the task.

Consider a small example with two features: annual income in pounds, ranging roughly from twenty thousand to one hundred thousand, and number of years of employment, ranging from zero to forty. Suppose two customers differ by fifteen thousand pounds in income and by two years in employment. The squared difference in income contributes something on the order of two hundred and twenty five million to the squared distance calculation, while the squared difference in employment years contributes four. The income feature completely swamps the employment feature, not because it is more predictive, but purely because of the units it happens to be recorded in. If you switched income to be measured in thousands of pounds instead, the contribution drops to twenty two point five, a far more comparable figure. The underlying information has not changed, but the model's behaviour has changed dramatically simply because of a unit choice.

This same logic applies to any algorithm that computes distances or inner products across features: k nearest neighbours, k means clustering, support vector machines with common kernels, and principal component analysis all fall into this category. Gradient based optimisation has a related but slightly different sensitivity. When features have wildly different scales, the loss surface becomes elongated and stretched in certain directions, which makes gradient descent take a zigzagging path towards the minimum rather than a direct one. This slows convergence and can require far more careful tuning of the learning rate than would otherwise be necessary.

Feature Scaling: Which Models Care and Which Do Not

A worked example with gradient descent

Imagine fitting a simple linear regression using gradient descent, with one feature measured in the range zero to one and another measured in the range zero to ten thousand. A learning rate that works well for updating the weight on the small scale feature will likely be far too large for the weight on the large scale feature, causing that weight to oscillate or diverge, while a learning rate small enough to stabilise the large scale feature will make learning painfully slow for the small scale feature. You end up needing per feature learning rates or adaptive optimisers just to compensate for a problem that scaling would have solved directly.

Standardising both features to have a mean of zero and a standard deviation of one puts them on comparable footing. The loss surface becomes closer to circular rather than a long thin ellipse, and a single learning rate can work reasonably well for both weights simultaneously. This is why standardisation, or occasionally min max normalisation, is treated as close to mandatory before fitting neural networks, logistic regression trained by gradient descent, and support vector machines. It is not about correctness in a strict mathematical sense for every one of these, but about making optimisation tractable and efficient in a reasonable number of iterations.

Regularised linear models add another layer to this argument. Ridge and lasso regression penalise the magnitude of coefficients directly. If one feature is on a much larger scale than another, its coefficient will naturally be smaller to produce the same effect on the prediction, and the penalty term will unfairly shrink coefficients on small scale features less than it shrinks coefficients on features that happen to need larger numeric magnitudes to have equivalent predictive effect. Scaling before fitting a regularised linear model is not optional if you want the penalty to be applied fairly across features, and I would treat it as a near hard requirement in any serious pipeline.

Why tree based models genuinely do not care

Now consider the other side of the divide. Decision trees, random forests, and gradient boosted trees make splits by asking questions like whether a feature value is above or below some threshold. Crucially, the specific numeric value of the threshold does not matter for the structure of the tree, only the ordering of values matters. If you take a feature and apply any monotonic transformation to it, such as multiplying by a constant, adding a constant, or taking a logarithm, the relative ordering of every data point along that feature is preserved exactly. The tree will find a threshold in the transformed space that produces an identical split to the one it would have found in the original space.

Take the income example again. A tree looking for a split on income might decide that customers earning above fifty two thousand pounds behave differently from those below that threshold. If you rescale income to be measured in thousands, the equivalent split simply becomes fifty two point zero rather than fifty two thousand. The tree makes exactly the same decision, produces exactly the same partition of the data, and achieves exactly the same reduction in impurity. Nothing about the model's predictive performance changes whatsoever.

This property extends to the ensembles built on top of trees as well. Random forests and gradient boosted trees inherit the same scale invariance because they are built from the same threshold based splitting logic, just aggregated across many trees or fitted sequentially to residuals. This is a genuinely useful fact in practice: if you are building a gradient boosting model, you can leave skewed, unscaled, and even wildly heterogeneous features exactly as they are, and the model's ability to find useful splits will be unaffected. Time spent scaling features for a tree based pipeline is time that produces no measurable benefit.

Where naive Bayes and other probabilistic models sit

Naive Bayes classifiers occupy an interesting middle position that is worth spelling out because it often gets lumped in incorrectly with either the scale sensitive or scale invariant camp. Gaussian naive Bayes estimates a mean and variance for each feature within each class and then evaluates the likelihood of a new point under that fitted Gaussian distribution. Because the mean and variance are estimated directly from the data for each feature independently, and because the likelihood calculation for one feature does not interact numerically with the likelihood calculation for another feature in a way that involves comparing their raw magnitudes, scaling has essentially no effect on the classifier's decisions. The estimated distribution simply adapts to whatever scale the data happens to be on.

This differs from k nearest neighbours or clustering precisely because naive Bayes treats each feature independently rather than combining them into a single joint distance metric. There is no cross feature comparison of magnitude happening anywhere in the calculation, so there is no mechanism by which a large scale feature could dominate a small scale one. It is a useful example to keep in mind because it shows that the scale sensitivity split is not simply about whether a model is linear or probabilistic in nature, but specifically about whether the model's internal computation mixes magnitudes across features.

Logistic regression, by contrast, does mix magnitudes across features because it computes a weighted sum of feature values before passing the result through a sigmoid function. This puts logistic regression firmly in the scale sensitive camp for the same reasons as linear regression, particularly when fitted with gradient based optimisation or when regularisation is applied. It is worth being precise about this distinction rather than assuming that all probabilistic classifiers behave the same way.

Practical consequences for pipeline design

The practical upshot is that scaling decisions should be driven by the specific model you are fitting, not applied as a blanket default across every project. If your pipeline includes k nearest neighbours, k means, support vector machines, principal component analysis, logistic regression fitted with regularisation, or any neural network, scaling before fitting is close to essential, and standardisation using training set statistics only is the standard approach to avoid leaking information from validation or test data into the transformation.

If your pipeline is built entirely around decision trees, random forests, or gradient boosted trees, you can save yourself the effort and the added complexity of maintaining a scaler object, since it will not change the model's predictions at all. This matters more than it sounds in production settings, where every additional transformation step is another thing that can go wrong, another artefact that needs to be versioned alongside the model, and another point of failure if the scaler is accidentally fitted on the wrong subset of data.

There is also a subtlety worth mentioning for pipelines that combine multiple model types, such as an ensemble mixing a gradient boosted tree with a logistic regression or a neural network. In these cases you generally need to scale the features for the scale sensitive components while leaving the tree based component to work on whatever representation suits it best, which sometimes means maintaining two versions of the feature set or applying the scaler only within the branch of the pipeline that needs it. Being explicit about which model needs what, rather than applying a single global transformation and hoping it works for everything, is the more rigorous approach and tends to produce more predictable, more debuggable systems.

The takeaway

Feature scaling is not a universal best practice to apply reflexively before every model. It is a targeted fix for a specific mathematical property, namely that some models combine raw feature magnitudes across dimensions in ways that make the choice of units matter, while others rely purely on the ordering of values within each feature and are therefore completely indifferent to scale. Distance based methods, gradient based optimisation, and regularised linear models fall into the sensitive category. Tree based models and, for different reasons, naive Bayes classifiers fall into the indifferent category.

The habit worth building is not to scale everything automatically, but to ask what your specific model does internally with the numbers you feed it, and to let that answer determine your preprocessing steps. It is a small piece of reasoning, but it saves wasted effort, avoids subtle bugs in regularisation and optimisation, and generally makes for a cleaner, more defensible pipeline that you can explain properly if someone asks why you made the choices you did.

← All writing See the project case studies →