Seven-class classification of objects cropped from UAV video — benchmarking classical features against CNNs, and quantifying exactly how much a naïve train/test split inflates the numbers.
The task was to classify 8,903 object crops — car, bus, truck, van, person, bicycle, motor — extracted from a 146-frame UAV sequence. Two things make it hard: severe class imbalance (car ≈ 42% of samples, bicycle ≈ 1%), and the fact that crops from the same vehicle track across consecutive frames are near-duplicates. Split those randomly and the test set is full of objects the model already saw in training — the score looks great and means nothing.
| Model | Accuracy | Macro-F1 |
|---|---|---|
| HOG + Linear SVM | 0.776 | 0.402 |
| HOG + RBF SVM | 0.824 | 0.359 |
| HOG + Random Forest | 0.803 | 0.256 |
| MobileNetV2 | 0.851 | 0.480 |
| ResNet-18 | 0.868 | 0.462 |
| EfficientNet-B0 | 0.876 | 0.460 |
CNNs clearly beat the classical baselines on macro-F1 — the imbalance hurts HOG+RF most (0.256), where minority classes collapse. Among CNNs the three backbones are close; EfficientNet-B0 edges accuracy, MobileNetV2 the macro-F1.




This was the real contribution. Under a naïve random split, a Random Forest scored 0.906 accuracy / 0.776 macro-F1. The exact same model under a correct track-aware group split dropped to 0.803 / 0.256 — a macro-F1 collapse of 0.52. Almost all of the apparent "performance" was the model recognising near-duplicate crops it had already trained on.
| Split | Accuracy | Macro-F1 |
|---|---|---|
| Naïve random (leaky) | 0.906 | 0.776 |
| Track-aware (honest) | 0.803 | 0.256 |
The lesson generalises well beyond this dataset: the split protocol can matter more than the model.
Two controlled comparisons on ResNet-18: transfer learning beat training from scratch (0.868 vs 0.820 accuracy), confirming the value of ImageNet features on a small dataset; explicit imbalance handling traded a little accuracy for more balanced per-class behaviour.

