← All writing
Deep Learning · 6 min read · 5 Jul 2026

Attention, Intuitively: Queries, Keys, and Values

A worked, no-nonsense walkthrough of why attention is really just a soft lookup, and why that framing makes transformers click.

Cover image for the article: Attention, Intuitively: Queries, Keys, and Values

Why this idea is worth slowing down for

Most people meet attention through a diagram: three matrices labelled Q, K, and V, an arrow into a softmax, and a promise that this is how transformers 'know what to look at'. The equation is short enough to memorise without ever understanding it, and I think that is exactly the trap. You can recite softmax(QK^T / sqrt(d))V for months and still not have a working mental model of what a query or a key actually represents.

I want to build the idea from something ordinary: a lookup table. A dictionary maps a key to a value. You look up 'apple', you get its definition. That is a hard lookup: exact match or nothing. Attention is the same idea made soft. Instead of an exact key match, every query compares itself against every key, gets a similarity score for each, and then blends all the values together, weighted by those scores. Nothing mystical is happening; it is a weighted average where the weights are learned to depend on content rather than fixed position.

This matters practically because so much of what transformers do well, resolving pronouns, aligning words across languages, deciding which earlier token is relevant to the current one, reduces to this one operation done many times in parallel. If you understand one attention head properly, you understand the mechanical core of the architecture. Everything else, the multiple heads, the feed-forward layers, the residual connections, is scaffolding around this soft lookup.

Building the intuition with a small worked example

Suppose we are processing the sentence 'the trophy did not fit in the suitcase because it was too small', and we want to work out what 'it' refers to. Each word has been turned into a vector by earlier layers. For 'it', the model produces a query vector: a question the word is asking, roughly 'what earlier thing might I be referring to?'. Every word in the sentence, including 'it' itself, also produces a key vector: an advertisement of what that word offers, roughly 'here is the kind of thing I am, so match me if I'm relevant'.

The attention score between 'it' and any other word is the dot product of the query for 'it' with the key for that word. A high dot product means the query and key point in a similar direction in the learned space, which the model has arranged to mean 'these are relevant to each other'. Say the dot product between the query for 'it' and the key for 'trophy' comes out at 8.2, while the dot product with the key for 'suitcase' comes out at 3.1, and with 'small' it is 2.4. After scaling and softmax, these might turn into weights of roughly 0.79 for trophy, 0.13 for suitcase, and small residual weights spread across everything else.

Now comes the value step, which is the part people often gloss over. Each word also has a value vector: the actual content it contributes if selected. The output for 'it' is not the trophy's key, and it is not a hard selection of one word; it is a weighted sum of all the value vectors, using those softmax weights. So the new representation of 'it' becomes roughly 79 per cent trophy's content, 13 per cent suitcase's content, plus small contributions from everywhere else. That blended vector then carries forward into later layers, effectively saying 'this token is mostly about the trophy, with a little context from the suitcase'.

The reason we need three separate projections, query, key, and value, rather than reusing the same vector for all three roles, is that a word's role in asking a question is different from its role in being matched against, which is different again from its role in what it contributes once matched. Learning three separate linear projections lets the model specialise each role rather than forcing one vector to do three jobs badly.

whiteboard with equations

Why the scaling and the softmax are not decorative

The division by the square root of the key dimension looks like a minor implementation detail, but it earns its place. As vector dimensionality grows, dot products between random vectors tend to grow in magnitude too, purely as a statistical artefact of summing more terms. If you feed large, unscaled dot products into a softmax, the output becomes extremely peaked, effectively collapsing to a near one-hot selection even when several keys are genuinely relevant. Scaling by the square root of the dimension keeps the pre-softmax scores in a range where the softmax can express genuine uncertainty rather than being forced into overconfident, near-binary choices.

The softmax itself is doing something specific: converting arbitrary real-valued similarity scores into a proper probability distribution that sums to one, so the values can be combined as a weighted average rather than an unbounded sum. This is why attention weights are interpretable, loosely, as 'how much of my output comes from this position', even though I would be cautious about over-reading them as a full explanation of model behaviour; a high attention weight tells you about the blending, not necessarily about causal importance downstream.

It is also worth being honest about what this mechanism does not give you for free. Attention has no inherent sense of order; the trophy and suitcase example only works because positional information has been injected elsewhere, since the raw query-key-value operation treats the input as an unordered set. And multiple heads exist precisely because one query-key-value projection can only capture one kind of relevant relationship at a time; running several in parallel with different learned projections lets the model track syntactic relationships in one head and coreference-like patterns in another, then concatenate the results.

The practical takeaway

If you are building or debugging anything transformer-based, the query-key-value framing gives you a genuinely useful diagnostic habit. When a model attends to the wrong thing, ask which of the three roles is likely at fault: is the query poorly formed because the token's own representation is impoverished, are the keys not discriminative enough to separate relevant from irrelevant context, or are the values themselves not carrying the content that downstream layers need? These are different failure modes with different fixes, and collapsing them into one vague notion of 'attention isn't working' will not get you very far.

The other lesson I keep coming back to is that attention weights are a useful diagnostic, not a full explanation. Treat visualised attention maps as a hint about where information is flowing, worth cross-checking against ablations or probing experiments, rather than as proof of what the model has 'understood'. Intuition built from a concrete example like the trophy and the suitcase will serve you better than the formula alone, because it forces you to be precise about what is being asked, what is being matched, and what is actually being carried forward.

server room data center
← All writing See the project case studies →