Embeddings 101: What Cosine Similarity Actually Measures
Cosine similarity is the default metric for comparing embeddings, but it quietly assumes a lot about your vectors. Here is what it really computes, worked through with numbers.
Why this matters more than it looks
Cosine similarity turns up everywhere embeddings are used: semantic search, recommendation, clustering, deduplication, retrieval augmented generation. It is treated as a neutral, almost mathematical fact of life, the default choice you reach for without thinking. That is precisely why it is worth interrogating. Most people can state the formula but cannot say, in plain terms, what it is actually measuring about two vectors, and that gap tends to surface later as a confusing bug: two documents that feel semantically unrelated score suspiciously high, or a model ranks a short, generic sentence above a longer, more specific one for reasons nobody can explain.
The short answer is that cosine similarity measures the angle between two vectors and deliberately throws away their length. That single design choice, ignoring magnitude, is either the right decision or the wrong one depending on what your embedding space actually encodes. If you do not know which, you are trusting a metric you have not verified. Given how much downstream behaviour depends on similarity scores, that is a risk worth removing early rather than discovering in production.
The mechanics, worked through with numbers
Cosine similarity between two vectors a and b is the dot product of the vectors divided by the product of their magnitudes. The dot product captures how much the vectors point in the same direction and how large they both are; dividing by the magnitudes cancels out the size and leaves only the direction. The result is a number between minus one and one, where one means the vectors point exactly the same way, zero means they are orthogonal, and minus one means they point in opposite directions.
Take a small concrete example. Suppose vector A is (4, 0) and vector B is (2, 0). Their dot product is 8, and their magnitudes are 4 and 2 respectively, so cosine similarity is 8 divided by 8, which equals 1. Despite B being half the length of A, the two vectors are judged identical in direction, so cosine similarity says they are a perfect match. Now compare A with vector C at (1, 1). The dot product is 4, the magnitude of C is roughly 1.41, and the magnitude of A is 4, giving a cosine similarity of about 0.71. C points partly in a different direction from A, so the score drops, even though C's magnitude is smaller than either A or B.
This illustrates the core behaviour precisely: cosine similarity is blind to scale and sensitive only to orientation. Two embeddings that point in identical directions will score as maximally similar regardless of whether one represents a short fragment and the other a long, information-dense passage. In many embedding models, magnitude does carry a signal, sometimes correlating with how much content or confidence went into producing the vector, and cosine similarity throws that signal away by construction. That is not a flaw exactly, it is a deliberate simplification, but it is one you need to be aware you are making.
Why the magnitude assumption can bite you
Consider a retrieval system built on sentence embeddings, where you embed a user query and rank documents by cosine similarity to that query vector. If the embedding model happens to produce longer vectors for more generic, high-frequency phrasing, and shorter vectors for rare, specific terminology, cosine similarity will not penalise the generic match for being generic; it only cares about angular alignment. So a vague, broadly-worded document can end up sitting at nearly the same angle as a highly specific one relative to the query, and cosine similarity reports them as equally relevant even though a human would clearly prefer the specific match. Because the metric has already discarded magnitude, there is no way for it to express that preference unless the direction itself differs.
Another practical wrinkle appears when vectors are not zero-centred. If most embeddings in your space share a common dominant direction, perhaps because of how a particular architecture was trained, cosine similarity between almost any two vectors can be inflated toward a high positive value, since they are all leaning the same way relative to the origin. This is sometimes called anisotropy, and it means that a cosine score of 0.85 might not mean nearly as much as it appears to if the average cosine similarity between two unrelated vectors in that space is already 0.7. Without checking the baseline distribution of scores for your specific embedding model, a single similarity number is close to meaningless. Two numbers you should always look at before trusting cosine similarity in a new setting are the similarity between clearly related pairs and the similarity between clearly unrelated pairs, so you know where the useful range actually sits.
This has direct consequences for evaluation. If you are comparing two embedding models by looking at raw cosine similarity scores, you are implicitly assuming both models have a similar baseline distribution, which is rarely checked and rarely true. A model with heavily anisotropic embeddings can produce systematically higher cosine scores across the board without actually encoding better semantic structure. The honest fix is to evaluate on a task with ground truth, such as ranking accuracy or retrieval precision at a fixed cut-off, rather than eyeballing similarity magnitudes in isolation.
What to actually do about it
None of this means cosine similarity is a poor choice; it remains sensible whenever the meaningful information in your embeddings lives in direction rather than length, which is true for many transformer-based sentence and word embeddings once they have been reasonably trained. The practical discipline is to stop treating it as a universal default and start treating it as an assumption you test. Before relying on cosine similarity in a new pipeline, check whether normalising vectors to unit length before comparison changes your downstream results; if it does, magnitude was carrying information you just discarded.
It is also worth comparing cosine similarity against Euclidean distance on normalised vectors, since for unit-length vectors the two are monotonically related and will rank pairs identically, which is a useful sanity check that you understand your own space. If your embeddings are not normalised and you have not checked whether their magnitudes correlate with something meaningful, such as sentence length, confidence, or frequency, you are one unverified assumption away from a subtle ranking bug. The fix costs very little: normalise, inspect the score distribution on known positive and negative pairs, and only then commit to cosine similarity as your metric of record.