Some Intuition on Attention and the Transformer
What's the big deal, intuition on query-key-value vectors, multiple heads, multiple layers, and more.
ChatGPT and other chatbots (e.g., Bard, Claude) have thrust LLMs into the mainstream. As a result, more and more people outside ML and NLP circles are trying to grasp the concept of attention and the Transformer model. Here, we’ll address some questions and try to provide intuition on the Transformer architecture. The intended audience is people who have read the paper and have a basic understanding of how attention works.
To keep it simple, I’ll mostly refer to “words” in a “sentence”. Nonetheless, attention can apply to any generic set of items in a sequence. For example, instead of words, we could have tokens, events, or products. And instead of a sentence, we could have a paragraph, in-session behavior, or purchase history.
What’s the big deal about attention?
Consider machine translation as an example. Before attention, most translation was done via an encoder-decoder network. The encoder encodes the input sentence (“I love you”) via a recurrent model and the decoder decodes it into another language (“我爱你”).
Encoding an input sentence into a fixed-size vector for the decoder (source)
Via this approach, the encoder had to cram the entire input into a fixed-size vector which is then passed to the decoder—this single vector had to convey everything about the input sentence! Naturally, this led to an informational bottleneck. With attention, we no longer have to encode input sentences into a single vector. Instead, we let the decoder attend to different words in the input sentence at each step of output generation. This increases the informational capacity, from a single fixed-size vector to the entire sentence (of vectors).
Furthermore, previous recurrent models had long paths between input and output words. If you had a 50-word sentence, the decoder had to recall information from 50 steps ago for the first word (and that data had to be squeezed into a single vector). As a result, recurrent models had difficulty dealing with long-range dependencies. Attention addressed this by letting each step of the decoder see the entire input sentence and decide what words to attend to. This cut down path length and made it consistent across all steps in the decoder.
Finally, prior language models leaned heavily on a recurrent approach: To encode a sentence, we start with the first word (w1) and process it to get the first hidden state (h1). Then, we input the second word (w2) with the previous hidden state (h1) to derive the next hidden state (h2). And so on. Unfortunately, this process was sequential and prevented parallelization. Attention tackled this by reading the entire sentence in one go and computing the representation of each word, based on the sentence, in parallel.
What are query, key, and value vectors?
Imagine yourself in a library. You have a specific question (query). Books on the shelves have titles on their spines (keys) that suggest their content. You compare your question to these titles to decide how relevant each book is, and how much attention to give each book. Then, you get the information (value) from the relevant books to answer your question.
In attention, the query refers to the word we’re computing attention for. In the case of an encoder, the query vector points to the current input word (aka context). For example, if the context was the first word in the input sentence, it would have a query vector q1.
The keys represent the words in the input sentence. The first word has key vector k1, the second word has vector k2, and so on. The key vectors help the model understand how each word relates to the context word. If the first word is the context, we compare the keys to q1.
Attention is how much weight the query word (e.g., q1) should give each word in the sentence (e.g., k1, k2, etc). This is computed via a dot product between the query vector and all the key vectors. (A dot product tells us how similar two vectors are.) If the dot product between a query-key pair is high, we pay more attention to it. These dot products then go through a softmax which makes the attention scores (across all keys) sum to 1.
Each word is also represented by a value which contains the information of that word. These value vectors are weighed by the attention scores that sum to 1. As a result, each context word is now represented by an attention-based weightage of all the words in the sentence, where the most relevant words have higher weight.
What does the encoder and decoder do?
The encoder takes a text input, such as a sentence, and returns a sequence of embeddings. These output embeddings can then be used for classification, translation, semantic similarity, etc. Self-attention enables the encoder to weigh the importance of each word and capture both short and long-range dependencies.
In contrast, the decoder takes inputs such as a start-of-sentence token and (optional) embeddings from the encoder, and returns probabilities to select the next word. Self-attention enables the decoder to focus on different parts of the output generated so far; cross-attention (aka encoder-decoder attention) helps it attend to the encoder’s output.
How does the decoder generate words?
The decoder outputs the probability of the next word (i.e., every possible word has an associated probability). Thus, we can generate the next word by greedily picking the word with the highest probability. Alternatively, we can apply beam search and keep the top n predictions, generate the word after next for each of these top n predictions, and select whichever combination had less error.
Why have multiple attention heads?
Multiple heads lets the model consider multiple words simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element.
Consider the sentence: “The chicken crossed the road carelessly”. The following words are relevant to “crossed” and should be attended to:
-
The “chicken” is the subject doing the crossing.
-
The “road” is the object being crossed.
-
The crossing is done “carelessly”.
If we had a single attention head, we might only focus on a single word, either “chicken”, “road”, or “crossed”. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.
Why have multiple attention layers?
Multiple attention layers builds in redundancy (on top of having multiple attention heads). If we only had a single attention layer, that attention layer would have to do a flawless job—this design could be brittle and lead to suboptimal outcomes. We can address this via multiple attention layers, where each one uses the output of the previous layer with the safety net of skip connections. Thus, if any single attention layer messed up, the skip connections and downstream layers can mitigate the issue.
Stacking attention layers also broadens the model’s receptive field. The first attention layer produces context vectors by attending to interactions between pairs of words in the input sentence. Then, the second layer produces context vectors based on pairs of pairs, and so on. With more attention layers, the Transformer gains a wider perspective and can attend to multiple interaction levels within the input sentence.
Why have skip connections?
Because attention acts as a filter, it blocks most information from passing through. As a result, a small change to the inputs of the attention layer may not change the outputs, if the attention score is tiny or zero. This can lead to flat gradients or local optima.
Skip connections help dampen the impact of poor attention filtering. Even if an input’s attention weight is zero and the input is blocked, skip connections add a copy of that input to the output. This ensures that even small changes to the input can still have noticeable impact on the output. Furthermore, skip connections preserve the input sentence: There’s no guarantee that a context word will attend to itself in a transformer. Skip connections ensure this by taking the context word vector and adding it to the output.
• • •
Finally, here’s what Andrej Karpathy had to say about Transformers (and Attention).
The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously:1) expressive (in the forward pass)2) optimizable (via backpropagation+gradient descent)3) efficient (high parallelism compute graph)
— Andrej Karpathy (@karpathy) October 19, 2022
References
-
Attention Is All You Need
-
The Illustrated Transformer
-
Transformers From Scratch
-
Transformers From Scratch (yeap, same title, not an error)
-
Understanding the Attention Mechanism in Sequence Models
OG image prompt: “words in a sentence, emphasis on the words, with a pair of glasses, in the style of contrasting tones, artifacts of online culture, innovative page design, complexity theory, bold black and whites, bold color scheme –ar 2:1 –version 5.1”
If you found this useful, please cite this write-up as:
Yan, Ziyou. (May 2023). Some Intuition on Attention and the Transformer. eugeneyan.com. https://eugeneyan.com/writing/attention/.
or
@article{yan2023attention, title = {Some Intuition on Attention and the Transformer}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2023}, month = {May}, url = {https://eugeneyan.com/writing/attention/} }@article{yan2023attention, title = {Some Intuition on Attention and the Transformer}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2023}, month = {May}, url = {https://eugeneyan.com/writing/attention/} }Share on:
Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
transformer
HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
arXiv:2603.29090v1 Announce Type: new Abstract: World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slo

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
arXiv:2603.29069v1 Announce Type: new Abstract: Integer multiplication has long been considered a hard problem for neural networks, with the difficulty widely attributed to the O(n) long-range dependency induced by carry chains. We argue that this diagnosis is wrong: long-range dependency is not an intrinsic property of multiplication, but a mirage produced by the choice of computational spacetime. We formalize the notion of mirage and provide a constructive proof: when two n-bit binary integers are laid out as a 2D outer-product grid, every step of long multiplication collapses into a $3 \times 3$ local neighborhood operation. Under this representation, a neural cellular automaton with only 321 learnable parameters achieves perfect length generalization up to $683\times$ the training rang
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
arXiv:2603.29068v1 Announce Type: new Abstract: I present ARCS, a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution is Group Relative Policy Optimization (GRPO): I identify a critical
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Arkansas Tech launches new artificial intelligence track to meet growing workforce demand - KARK
<a href="https://news.google.com/rss/articles/CBMixAFBVV95cUxPX0RSVGZsMFV2VS1JOVFxRU1MNTllclp6UnpmZzREd252aUpzZ0F2andiWUxwUTN6RkZiWjdNTVBiTmFDMDRZeGJzOVRoNHJaaVhGbTBGUnpjQVFWLTN2WG5xZ1B6R3E2OXk0VlBMZ2JZN0xScThIM2NTQ3pidW05SkFWWTFZZ2lFZ3pERmQ1cUliRXA4LTBZSDRoSVZkYmtwR3lQMHBrdjFfV2pfVlMwb1pWY0ZoRGptTEVsM3NQYlZjaXNV?oc=5" target="_blank">Arkansas Tech launches new artificial intelligence track to meet growing workforce demand</a> <font color="#6f6f6f">KARK</font>
Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained on 28T Tokens with Scaled Reinforcement Learning - MarkTechPost
<a href="https://news.google.com/rss/articles/CBMi8AFBVV95cUxPWVR3WDE5VFpuZ3JibVBzOWFtRkxFNm1EZndVZW84amZ3eV9tMXZaRWdrRlJiSTFlUE4xZW5OaDF4YzlYUDVvV3pJV1daUjNfWkR0NUg5SnlrSEdObWdYdHFmejRJOUM1UjFxUFc5TEppVXZyT09PUDhmdUx6eVF2QkNSbVk1NmRmTkV6cl9ESnAtMEtjdS1EaEtJU01iYnZ0MmFJTDdQbndwbGt6RmcyXzZ2SnJ2ejJ6NmxUUUg5RGhqR09KN0NmaGhwd2R2djlicHQ2X1pnZ3pKb0dSZmVhMnU5bU5WTzYwek5ldjZfaUHSAfYBQVVfeXFMTnFjYWt0TnhoaHhhalFsdVphSk5RV1MxRFY1UWJqWHJuU1FwVlB1TVJJQnpOVlh6MVRKazZObV9rM1Q1eExzSEExd2hGcFc2OXpKLWpKT3dTOFV2c201RFdLOTV3a3J6RjV4Yzd0UXNRRFlmbUZMVy00OWkxSzlSaUU0VmlIWjgxNWNxTVhLZEMyQ2NOOU1rbUhEb2FHX1hPNVpRZzQ2ZEJpUkRfczcwQjU4anA4YVJXbFNpUVZHWnNkblhOV1NOemFzRmhwZFI1elg5SGREMVprcW90TmFZRG5aSGJLamYzc19FVGM3Q2J5NGpXbGpB?oc=5" target="_blank">Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained
Large language models show Dunning-Kruger-like effects in multilingual fact-checking | Scientific Reports - Nature
<a href="https://news.google.com/rss/articles/CBMiX0FVX3lxTE5WaGlPLU1EY21Qak9TVWxxaFRadUotYlZUNHU1LWJZQ3R1SnBVb1RiVVNocTFrT2l5Y1FDMFFDZ0tvZko1ZUhKN1p6V1Q1WGxYMHFyai05dC1abTV3YzZr?oc=5" target="_blank">Large language models show Dunning-Kruger-like effects in multilingual fact-checking | Scientific Reports</a> <font color="#6f6f6f">Nature</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!