Evaluation · 9 min read · 3 Jul 2026

Class Weights vs Resampling for Imbalanced Data

Two common fixes for imbalanced classification solve different problems. Confusing them leads to models that look better on paper and worse in production.

Why imbalance keeps tripping people up

Almost every practitioner meets class imbalance early, usually through a model that reports ninety-eight percent accuracy while never once correctly flagging the rare class. The instinct is to reach for a fix, and the two most common fixes are class weighting and resampling. They are often presented as interchangeable, sometimes even combined without much thought, but they intervene at different points in the learning process and they carry different risks. Treating them as the same trick is how you end up with a model that performs beautifully on a validation set and falls apart the moment it meets real, unseen data.

I want to be precise about what each method actually does, because the vague version, that both methods just make the model pay more attention to the minority class, is true in spirit but useless in practice. Understanding the mechanism is what lets you choose sensibly and, just as importantly, evaluate the result without fooling yourself.

This post works through a small worked example, walks through the mechanics of both approaches, and spends real time on the evaluation traps that make imbalance work so easy to get wrong. The goal is not to declare a universal winner. It is to give you a way of reasoning about the choice that holds up regardless of the dataset in front of you.

A concrete imbalance to anchor the discussion

Suppose you have ten thousand transactions, of which two hundred are fraudulent. That is a two percent positive rate, a fairly typical level of imbalance for fraud, churn, or rare disease detection tasks. A model that predicts the majority class every single time achieves ninety-eight percent accuracy and is completely useless. This is the baseline you are implicitly competing against, and it is worth writing it down before you start, because it anchors every later number.

Now imagine you train a plain logistic regression on this data with no adjustment at all. The loss function is dominated by the nine thousand eight hundred negative examples, and the model has very little incentive to get the two hundred positives right, since misclassifying all of them barely moves the average loss. The result is typically a model with high precision on the rare class, if it predicts positive at all, but very low recall, meaning it misses most of the fraud cases entirely.

This is the starting point that both class weighting and resampling try to correct. They differ in how they correct it, and that difference matters more than it first appears.

Class Weights vs Resampling for Imbalanced Data

How class weighting actually works

Class weighting changes the loss function, not the data. In our example, you might assign the positive class a weight of roughly forty nine, matching the ratio of negatives to positives, or you might tune it as a hyperparameter. Every time the model makes a mistake on a fraud case, that mistake is multiplied by the weight before being added to the total loss. The negative examples are untouched. The optimiser now has a strong incentive to get the rare class right, because errors there cost far more.

The appeal of this approach is that the training set is left exactly as it was. Every example is seen exactly once per epoch, no example is duplicated, and no synthetic data is introduced. This matters for anything that relies on the true joint distribution of features, such as models that estimate feature interactions or produce calibrated probability outputs. Class weighting also tends to be cheap: for most standard classifiers and neural network loss functions, applying a weight is a one line change with no additional computation or memory cost.

The main limitation is that class weighting only reshapes the gradient signal; it does not give the model any new examples to learn from. If your two hundred fraud cases do not adequately represent the diversity of fraud patterns that exist in the wider population, weighting them more heavily will make the model overfit harder to those two hundred specific cases rather than generalising to the pattern of fraud in general. Weighting amplifies whatever signal already exists in the minority class; it cannot manufacture signal that was never there.

How resampling actually works

Resampling changes the data, not the loss function. Oversampling duplicates or synthesises minority examples until the classes are more balanced; undersampling removes majority examples for the same purpose. Continuing the fraud example, random oversampling might duplicate each of the two hundred fraud cases roughly forty nine times, or a synthetic method might generate new points by interpolating between existing minority examples in feature space, producing plausible but artificial fraud cases that were never actually observed.

Undersampling instead reduces the nine thousand eight hundred legitimate transactions down to something closer to two hundred, throwing away the vast majority of the majority class. This is fast and simple, and it can work well when the majority class is highly redundant, meaning many of those transactions look nearly identical and little information is lost by discarding most of them. The obvious cost is that you are discarding data, and if the majority class actually contains useful diversity, for instance several distinct legitimate spending patterns, undersampling can quietly erase that diversity and hurt the model's ability to correctly identify legitimate transactions.

Synthetic oversampling methods try to avoid the crude duplication problem by generating new points rather than copying existing ones, interpolating between a minority example and one of its nearest minority neighbours. This can genuinely help in low dimensional, well behaved feature spaces. It becomes considerably more fragile in high dimensional or highly categorical feature spaces, where interpolating between two points can produce a synthetic example that does not correspond to anything plausible, effectively adding noise rather than signal. This is a subtlety that gets lost in a lot of introductory material, which tends to present synthetic oversampling as a free upgrade over random duplication rather than a technique with its own failure modes.

The leakage trap that ruins most comparisons

Here is the mistake I see most often, and it is serious enough to invalidate an entire evaluation. If you oversample or apply synthetic generation before splitting into training and test sets, or before cross validation folds are created, synthetic or duplicated points derived from a given minority example can end up in both the training fold and the test fold. The model is then effectively tested on near copies of examples it was trained on, and the reported recall or F1 score will look excellent while telling you almost nothing about generalisation to new fraud cases.

The correct order is to split first, then resample only the training portion, leaving the test set exactly as it was originally, with its natural imbalance intact. This is not a minor technicality. In my experience, this single ordering mistake accounts for a large share of the wildly optimistic imbalance results that later fail to reproduce on genuinely new data. If a colleague reports a large jump in recall from applying a resampling method, the first question worth asking is whether resampling happened before or after the split.

Class weighting has a structural advantage here, precisely because it never touches the data. There is no analogous leakage risk from weighting, since the training set composition is unchanged; you only need to be careful that class weights are computed from the training fold's class distribution and not from the full dataset, which is a much smaller and easier mistake to avoid.

Choosing a metric before choosing a method

Neither class weighting nor resampling means anything if you evaluate the result with accuracy. In the fraud example, a model that predicts negative for everything still scores ninety-eight percent accuracy after any amount of weighting or resampling was applied to training, because accuracy is measured on the untouched test set and is dominated by the same imbalance that motivated the fix in the first place. Precision, recall, F1 score, and the area under the precision-recall curve are far more informative for rare-class problems, because they focus attention on how the model performs specifically on the class you actually care about.

It also matters to decide, before running any experiment, whether you care more about recall, catching as much fraud as possible, or precision, minimising false alarms that waste investigator time. Class weighting lets you tune this trade-off directly and continuously, by adjusting the weight ratio and watching the precision-recall curve shift. Resampling offers a similar lever through the resampling ratio, but the relationship between resampling ratio and the eventual precision-recall trade-off is less direct and usually needs to be discovered empirically through repeated experiments rather than reasoned about in advance.

Whichever method you use, compare it against a genuine baseline: the same model, same features, same split, with no adjustment at all. It is surprisingly common for a well tuned decision threshold on an unweighted, unresampled model to match or beat a more elaborate resampling pipeline, and you will never discover that if the unadjusted baseline is never run.

A practical decision rule and closing thought

My working rule is this: start with class weighting, because it is cheap, leaves the data untouched, and is very fast to test across a range of weight ratios. If the results are unsatisfactory and you suspect the minority class genuinely lacks representative diversity rather than just being outvoted in the loss function, consider resampling as a second step, and be strict about doing it only after the split. Combining both is sometimes useful, but treat it as an additional experiment to justify rather than a default, since the two effects can stack in ways that are hard to interpret without careful ablation.

It is also worth remembering that imbalance handling is not always the right lever to pull at all. Sometimes the real problem is that the features available simply do not separate the classes well, and no amount of reweighting or resampling will manufacture separability that is not present in the feature space. In those cases, effort spent on better features or better labels will outperform any amount of tuning on the imbalance handling side.

The underlying lesson generalises well beyond this specific pair of techniques: any method that changes how a model learns should be understood in terms of what it actually alters, the data or the objective, and evaluated with a metric and a split that reflect the real decision the model will eventually have to make. Get that discipline right, and the choice between class weights and resampling becomes a straightforward, testable decision rather than a matter of habit or fashion.

← All writing See the project case studies →