A primer on the Transformer architecture.
The seminal paper “Attention Is All You Need” by Vaswani et al. quietly revolutionised the world of AI when it appeared in 2017. Little did its authors know they were laying the foundations for the creation of today’s Large Language Models, along with AI’s entrance into the mainstream. But before all that chaos, there was just this elegant idea about attention mechanisms that genuinely changed everything.
The work presented in the paper was initially conceived to find a more efficient and performant solution to the problem of sequence-to-sequence modeling, a machine learning task that involves converting an input sequence into an output sequence, potentially of different lengths. The experiments from the paper presented results for English-to-German and English-to-French translation tasks. The state-of-the-art solutions at the time relied mainly on complex building blocks such as Recurrent Neural Networks and Convolutional Neural Networks, along with attention mechanisms first introduced in the 2014 paper “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau, Cho, and Bengio). The main hypothesis, as the title of the paper states, is that “Attention is all you need,” and by introducing a novel architecture called the Transformer, which exclusively leverages highly parallelizable and efficient attention mechanisms, it is possible to achieve state-of-the-art results on machine translation tasks without using recurrence or convolution.
Before trying to understand the inner workings of the Transformer architecture you should be familiar with a couple of preprocessing operations needed in order to feed text to a neural network. Here we will introduce some lexical terms that you will find all over the literature so, if you are a beginner in the field, take your time to digest this section.
Let’s say we have an input sentence such as “Manuscripts don’t burn.”, this text has to be converted into an input sequence composed of multiple tokens, this procedure is called tokenization. A token can be a word, or a sub-word, depending on the tokenization technique (Byte-Pair Encoding in this case, but we will not further discuss it in this blog post).
“Manuscripts don’t burn.” → [“Man”, ”uscript”, “s”, “ don”, “’t”, “ burn”]
Here the original text was sub-divided in 6 tokens. We have a fixed number of possible tokens, which represents the cardinality of our dictionary D ( e.g., in the paper a shared vocabulary of 37000 tokens was used for the English-to-German translation experiments, and a shared vocabulary of 32000 tokens was used for the English-to-French translation experiments). Once we have our tokens, we will map each token into a randomly initialised vector x of dimension dmodel, we call these vectors embeddings.
x(Man), x(uscript), x(s), x( don), x('t), x( burn)∈Rdmodel
Given that the Transformer architecture aims to process all tokens in parallel, and these randomly initialized embeddings contain no information regarding the position of the tokens in the sentence (we would like the model to be able to distinguish between “Manuscripts don’t burn” and ” don’t burn Manuscripts”), we add a positional encoding vector to each embeddding vector to obtain the final input embeddings.
x1=x(Man)+p1x2=x(uscript)+p2x3=x(s)+p3x4=x( don)+p4x5=x('t)+p5x6=x( burn)+p6
We obtain these positional encoding vectors in the following way:
pi=f(i)f:N→Rdmodelf(i)k:={sin(wk⋅i) if k is evencos(wk⋅i) if k is odd } wk=1100002kdmodel
Just for completeness:
pi=[sin(w0⋅i)cos(w0⋅i)sin(w1⋅i)cos(w1⋅i)⋯sin(wdmodel/2⋅i)cos(wdmodel/2⋅i)]
Here is a visualization of how these positional embeddings would look like for the first 1024 embeddings where dmodel=256
After reading this section you may already have many questions like:
But this would require a separate blog post, so I’m asking you to proceed with this article without questioning the validity of such positional encodings.
For simplicity, throughout this blog post, we’ll assume that each word in a sentence corresponds to a different token. Keep in mind that in practice, tokenization is more complex.
In the following sections, all vectors are represented as lowercase bold and are considered to be column vectors.
Let’s develop an intuition about the most important building block of the entire architecture: the scaled dot-product attention mechanism. This component allows each word embedding to pay attention to surrounding words (its context) and update itself to better represent its true semantic meaning.
Consider these two example sentences:
“I want to wear my cool jacket”
“Today the weather is quite cool”
Before training the model the two word-embeddings for the word cool would be the same, even though they carry a well distinct meaning. As humans we understand that cool in the first sentence means “fashionable”, since it’s juxtaposed to the word “jacket”; and cool in the second sentence means “chilly” since it refers to “weather”. So the way we’ll formalize this attention mechanism should allow for different words in the sentence to influence each other.
The “scaled dot-product attention” presented in the paper takes three input matrices:
A query matrix Q
A key matrix K
A value matrix V
These matrices have the following dimensions:
Q∈Rntokens×dkK∈Rntokens×dkV∈Rntokens×dv
We’ll define dk and dv more precisely later, but for now, understand that these represent projections of the original embedding vectors, where dk,dv<dmodel.
To grasp the role of these matrices, think of them as participating in an information exchange system:
The attention mechanism works by computing how relevant each token j is to each token i by measuring the similarity between qi and kj. When this similarity is high, token i will incorporate more of token j’s value vector vj into its contextual representation.
In our example with “cool,” the query vector for “cool” in the first sentence might strongly match key vectors from fashion-related words like “jacket,” causing its representation to shift toward the “fashionable” meaning. Similarly, in the second sentence, “cool” might match strongly with weather-related words, shifting its representation toward the “chilly” meaning.
The values V are updated as follows:
V′=Attention(Q,K,V)=softmax(QKT√dk)VTaking a closer look at the nominator of the softmax argument, we see that it consists of a sort of similarity matrix between the query vectors stored in Q and the key vectors stored in K
QKT=[qT(Today)qT(the)qT(weather)qT(was)qT(quite)qT(cool)][k(Today)k(the)k(weather)k(was)k(quite)k(cool)]QKT=[q(Today)⋅k(Today)q(Today)⋅k(the)q(Today)⋅k(weather)q(Today)⋅k(was)q(Today)⋅k(quite)q(Today)⋅k(cool)q(the)⋅k(Today)q(the)⋅k(the)q(the)⋅k(weather)q(the)⋅k(was)q(the)⋅k(quite)q(the)⋅k(cool)q(weather)⋅k(Today)q(weather)⋅k(the)q(weather)⋅k(weather)q(weather)⋅k(was)q(weather)⋅k(quite)q(weather)⋅k(cool)q(was)⋅k(Today)q(was)⋅k(the)q(was)⋅k(weather)q(was)⋅k(was)q(was)⋅k(quite)q(was)⋅k(cool)q(quite)⋅k(Today)q(quite)⋅k(the)q(quite)⋅k(weather)q(quite)⋅k(was)q(quite)⋅k(quite)q(quite)⋅k(cool)q(cool)⋅k(Today)q(cool)⋅k(the)q(cool)⋅k(weather)q(cool)⋅k(was)q(cool)⋅k(quite)q(cool)⋅k(cool)]
After normalizing the values by √dk and taking the softmax we see that the new value vectors stored in V′ will be the result of a weighted sum of the old ones.
V′=softmax(QKT√dk)V=[0.400.100.100.200.100.100.050.400.150.200.100.100.150.200.350.150.050.100.100.050.150.300.150.250.150.100.100.100.350.100.050.050.600.050.050.2][VT(Today)VT(the)VT(weather)VT(was)VT(quite)VT(cool)]
And, as anticipated, we may have that the old value vector for the word “weather” VT(weather) will have the chance to influence the new value vector for the word “cool” V′T(cool).
[V′T(Today)V′T(the)V′T(weather)V′T(was)V′T(quite)V′T(cool)]=[0.40VT(Today)+0.10VT(the)+0.10VT(weather)+0.20VT(was)+0.10VT(quite)+0.10VT(cool)0.05VT(Today)+0.40VT(the)+0.15VT(weather)+0.20VT(was)+0.10VT(quite)+0.10VT(cool)0.15VT(Today)+0.20VT(the)+0.35VT(weather)+0.15VT(was)+0.05VT(quite)+0.1VT(cool)0.10VT(Today)+0.05VT(the)+0.15VT(weather)+0.30VT(was)+0.15VT(quite)+0.25VT(cool)0.15VT(Today)+0.10VT(the)+0.10VT(weather)+0.10VT(was)+0.35VT(quite)+0.10VT(cool)0.05VT(Today)+0.05VT(the)+0.60VT(weather)_+0.05VT(was)+0.05VT(quite)+0.2VT(cool)]
In case you are wondering the rational behind the denominator of the softmax argument, the authors claim the following:
——————————————————————————————————————————————–
We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by √1dk. To illustrate why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot product, q⋅k=∑dki=1qiki, has mean 0 and variance dk.
——————————————————————————————————————————————–
Now that we understand how the scaled dot-product attention mechanism works, a natural question arises: how do we obtain the Q, K, and V matrices in the first place?
In the Transformer architecture, these matrices can be derived in different ways depending on whether we’re computing self-attention or cross-attention. For now, let’s focus on self-attention, as it’s conceptually simpler to grasp. We’ll introduce cross-attention later when discussing the Transformer’s encoder and decoder blocks.
In self-attention, all three matrices Q, K, and V are derived from the same input sequence, which corresponds to the input embeddings, that gets projected through learned linear transformations:
Note that the query and key spaces intentionally share the same dimensionality (dk). This is because we compute dot products between queries and keys, which requires matching dimensions. The dimensionality of the values dv can be different even though it’s common to have dv=dk.
If we denote our input embeddings as matrix X∈Rntokens×dmodel, where ntokens is the sequence length and dmodel is the embedding dimension, then:
Q=XWQK=XWKV=XWV
With these projections, each token in our sequence now has corresponding query, key, and value vectors. These are then used in the scaled dot-product attention formula we covered earlier.
The authors of the paper take the scaled dot-product attention concept a step further with Multi-Head Attention, which allows the model to jointly attend to information from different representation subspaces.
In Multi-Head Attention, instead of performing a single attention function, the model performs attention multiple times in parallel. Each of these parallel attention operations is called a “head”, and each head learns to focus on different aspects of the relationships between tokens.
For each head i from a total of h heads, we create separate projection matrices:
WQi∈Rdmodel×dk - Projects input embeddings into query space.
WKi∈Rdmodel×dk - Projects input embeddings into key space.
WVi∈Rdmodel×dv - Projects input embeddings into value space.
Denoting again our input token embeddings as matrix X∈Rntokens×dmodel. For each attention head i, we compute:
Qi=XWQiKi=XWKiVi=XWVi
Then, we apply the Scaled Dot Product attention to each head independently:
headi=Attention(Qi,Ki,Vi)=softmax(QiKTi√dk)Vi
The outputs from each head are concatenated and then projected once more using a final output projection matrix WO∈Rhdv×dmodel:
MultiHead(X)=Concat(head1,head2,…,headh)WO
Multiple attention heads allow the model to capture different types of relationships simultaneously:
Some heads might focus on local relationships between adjacent words.
Others might capture long-distance dependencies.
Some might attend to syntactic relationships, while others focus on semantic meaning.
Generally speaking, this approach helps the model build richer representations of the input text.
In practice, the paper uses h=8 parallel attention heads for the encoder and decoder layers, with dk=dv=dmodel/h=64.
Layer Normalization is another crucial component of the Transformer architecture that helps stabilize and accelerate training. Unlike Batch Normalization, which normalizes across the batch dimension, LayerNorm operates on the feature dimension for each individual sample.
For an input vector x∈Rdmodel (representing a single token’s embedding), LayerNorm applies the following transformation:
LayerNorm(x)=γ⊙x−μ√σ2+ϵ+β
Where:
μ=1dmodel∑dmodeli=1xi is the mean of the features
σ2=1dmodel∑dmodeli=1(xi−μ)2 is the variance of the features
ϵ is a small constant added for numerical stability
γ,β∈Rdmodel are learnable parameters (scale and shift)
⊙ denotes element-wise multiplication
When applied to a matrix of token embeddings X∈Rntokens×dmodel, LayerNorm processes each row independently, normalizing across the embedding dimension.
The main purpose of Layer Normalization is to prevent exploding or vanishing gradients by ensuring that the input to each sub-layer of the full transformer architecture has consistent statistical properties (that allow for faster convergence as well!).
The last building block that will help us compose the whole Transformer architecture is the Position-wise Feed Forward Network (FFN). Despite its simplicity, this component contributes significantly to the model’s expressive power.
The Feed Forward Network consists of two linear transformations with a ReLU activation in between:
FFN(x)=max(0,xW1+b1)W2+b2
Where:
x∈Rdmodel is the input vector for a single token
W1∈Rdmodel×dff and W2∈Rdff×dmodel are weight matrices
b1∈Rdff and b2∈Rdmodel are bias vectors
dff is the inner dimension of the feed-forward network (typically 4 times larger than dmodel)
Crucially, this FFN is applied to each position (token) separately and identically - hence “position-wise”. When working with a sequence of tokens represented as a matrix X∈Rntokens×dmodel, the FFN processes each row independently, using the same set of parameters W1,b1,W2,b2.
The Position-wise FFN serves mainly to introduce non-linearity through the ReLU activation, increasing the model’s capacity to learn complex functions.
In the paper, the authors use dff=2048 with dmodel=512, creating a significant expansion in the intermediate representation.
Now that we have a grasp of how the Multi Head Attention, Layer Normalization, and Feed Forward Network blocks work, we can visualize the full Transformer architecture, composed of both an encoder and a decoder block. In the context of neural machine translation, the encoder’s responsibility is to extract information from the original language (e.g., English) sentence, while the decoder has to predict the words (tokens) that will compose the sentence in the target language (e.g., German). We’ll use as an example the following English to German sentence pair:
Manuscripts don't burn→Manuskripte brennen nicht
Here is an illustration that depicts the flow of information from the encoder to the decoder during training: (Click on the image to enlarge it if you are reading from the browser)
The encoder is composed of a stack of N=6 identical layers. Each layer has two sub-layers:
A multi-head self-attention mechanism, where Q, K, and V are all derived from the output of the previous layer (or input embeddings for the first layer)
A position-wise fully connected feed-forward network
We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512.
The encoder processes a sequence of tokens with a fixed maximum length, known as the encoder’s context length. This parameter defines the upper limit of input tokens the model can handle in a single forward pass. In our toy example, we set this context length to ne=8, while the original Transformer paper implementation uses ne=512 (as seen in the official TensorFlow implementation). The context length should be chosen to accommodate the longest input sequences expected in your dataset. For sequences shorter than the context length, a special padding token “<PAD>” is added to fill the remaining positions. These padding tokens are typically masked out in the attention mechanism to prevent them from influencing the representations of actual content tokens.
How does this masking happen in practice? Coming back to our example we may have the following “similarity” matrix between the Query and Key vectors from the encoder:
QKT=[qT(Manuscripts)qT(don)qT('t)qT(burn)qT(PAD)qT(PAD)qT(PAD)qT(PAD)][k(Manuscripts)k(don)k('t)k(burn)k(PAD)k(PAD)k(PAD)k(PAD)]QKT=[qManuscripts⋅kManuscriptsqManuscripts⋅kdonqManuscripts⋅k'tqManuscripts⋅kburnqManuscripts⋅kPAD⋯qdon⋅kManuscriptsqdon⋅kdonqdon⋅k'tqdon⋅kburnqdon⋅kPAD⋯q't⋅kManuscriptsq't⋅kdonq't⋅k'tq't⋅kburnq't⋅kPAD⋯qburn⋅kManuscriptsqburn⋅kdonqburn⋅k'tqburn⋅kburnqburn⋅kPAD⋯qPAD⋅kManuscriptsqPAD⋅kdonqPAD⋅k'tqPAD⋅kburnqPAD⋅kPAD⋯⋮⋮⋮⋮⋮⋱]
Since the dot products that contain any “<PAD>” token shouldn’t influence the final Value vectors we add a padding mask Mpad to ensure padded tokens don’t influence the final attention matrix:
Mpad=[0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞]
The final Value update with padding mask applied will then look like this:
V′=softmax(QKT√dk+Mpad)V= If you need a refresher on the softmax function to understand why −∞ terms will map to 0 (and have null contribution) here it is:
softmax(zi)=ezi∑nej=1ezj
Same as in training.
The decoder is also composed of a stack of N=6 identical layers. However, each decoder layer has three sub-layers instead of two:
A masked multi-head self-attention mechanism, where Q, K, and V all come from the decoder’s previous layer output (or input embeddings for the first layer). The masking ensures that predictions for a position can only depend on known outputs at earlier positions (read the sub-section below for more details)
A multi-head cross-attention mechanism, where Q comes from the output of the decoder’s first sub-layer, while K and V come from the encoder’s output. This allows the decoder to focus on relevant parts of the input sequence.
A position-wise fully connected feed-forward network analogous to the one in the encoder.
As in the encoder, we apply residual connections around each sub-layer followed by layer normalization.
The decoder takes as input the right-shifted target sequence (we add a “<SOS>” - Start Of Sentence - token at the beginning of the sequence). In our example, the decoder input sequence is ["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"]
, where we’ve padded to match our context length nd=6.
Why right-shift the input? This right-shifting is crucial during training because of how the decoder learns to generate text. The decoder’s job is to predict the next token in the sequence based on what came before.
During training, we need to:
Give the decoder the tokens it should have predicted so far (to learn from)
Ask it to predict the next token in the sequence
Compare its prediction with the actual next token
Without right-shifting, the decoder would see the token it’s trying to predict! By shifting the target sequence right (adding a start token at the beginning), each position in the decoder can only attend to previous tokens, preserving the causal nature of language generation. This means at position 1, the decoder sees [“<SOS>”] and must predict “Manuskripte”, at position 2, the decoder sees [“<SOS>”, “Manuskripte”] and must predict “brennen” - and so on…
For the decoder’s self-attention, we need to apply both:
A padding mask (similar to the encoder) to ignore padded tokens
A causal mask to prevent tokens from attending to future positions
Let’s first visualize the “similarity” matrix between Query and Key vectors in the decoder’s self-attention:
QKT=[qT(SOS)qT(Manuskripte)qT(brennen)qT(nicht)qT(PAD)qT(PAD)][k(SOS)k(Manuskripte)k(brennen)k(nicht)k(PAD)k(PAD)]QKT=[qSOS⋅kSOSqSOS⋅kManuskripteqSOS⋅kbrennenqSOS⋅knichtqSOS⋅kPADqSOS⋅kPADqManuskripte⋅kSOSqManuskripte⋅kManuskripteqManuskripte⋅kbrennenqManuskripte⋅knichtqManuskripte⋅kPADqManuskripte⋅kPADqbrennen⋅kSOSqbrennen⋅kManuskripteqbrennen⋅kbrennenqbrennen⋅knichtqbrennen⋅kPADqbrennen⋅kPADqnicht⋅kSOSqnicht⋅kManuskripteqnicht⋅kbrennenqnicht⋅knichtqnicht⋅kPADqnicht⋅kPADqPAD⋅kSOSqPAD⋅kManuskripteqPAD⋅kbrennenqPAD⋅knichtqPAD⋅kPADqPAD⋅kPADqPAD⋅kSOSqPAD⋅kManuskripteqPAD⋅kbrennenqPAD⋅knichtqPAD⋅kPADqPAD⋅kPAD]
Now, we need to apply two masks:
First, the causal mask Mcausal ensures that tokens only attend to previous positions:
Mcausal=[0−∞−∞−∞−∞−∞00−∞−∞−∞−∞000−∞−∞−∞0000−∞−∞00000−∞000000]
Second, the padding mask Mpad prevents attention to and from padding tokens:
Mpad=[0000−∞−∞0000−∞−∞0000−∞−∞0000−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞]
The combined mask Mcombined=Mcausal+Mpad is then:
Mcombined=[0−∞−∞−∞−∞−∞00−∞−∞−∞−∞000−∞−∞−∞0000−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞]
The final Value update with both masks applied will then look like:
V′=softmax(QKT√dk+Mcombined)V
The cross-attention mechanism allows the decoder to focus on relevant parts of the encoder’s output while generating each token. Unlike the decoder’s self-attention, cross-attention does not require causal masking because the entire source sequence is already available.
In our example, the decoder input ["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"]
needs to attend to the encoder output for ["Manuscripts", "don", "'t", "burn", "<PAD>", "<PAD>", "<PAD>", "<PAD>"]
.
In cross-attention, the Queries come from the decoder, while the Keys and Values come from the encoder. This enables each decoder position to attend to all encoder positions:
QKT=[qT(SOS)qT(Manuskripte)qT(brennen)qT(nicht)qT(PAD)qT(PAD)][k(Manuscripts)k(don)k('t)k(burn)k(PAD)k(PAD)k(PAD)k(PAD)]QKT=[qSOS⋅kManuscriptsqSOS⋅kdonqSOS⋅k'tqSOS⋅kburnqSOS⋅kPADqSOS⋅kPADqSOS⋅kPADqSOS⋅kPADqManuskripte⋅kManuscriptsqManuskripte⋅kdonqManuskripte⋅k'tqManuskripte⋅kburnqManuskripte⋅kPADqManuskripte⋅kPADqManuskripte⋅kPADqManuskripte⋅kPADqbrennen⋅kManuscriptsqbrennen⋅kdonqbrennen⋅k'tqbrennen⋅kburnqbrennen⋅kPADqbrennen⋅kPADqbrennen⋅kPADqbrennen⋅kPADqnicht⋅kManuscriptsqnicht⋅kdonqnicht⋅k'tqnicht⋅kburnqnicht⋅kPADqnicht⋅kPADqnicht⋅kPADqnicht⋅kPADqPAD⋅kManuscriptsqPAD⋅kdonqPAD⋅k'tqPAD⋅kburnqPAD⋅kPADqPAD⋅kPADqPAD⋅kPADqPAD⋅kPADqPAD⋅kManuscriptsqPAD⋅kdonqPAD⋅k'tqPAD⋅kburnqPAD⋅kPADqPAD⋅kPADqPAD⋅kPADqPAD⋅kPAD]
Notice that the resulting attention matrix has dimensions 6×8, reflecting the decoder sequence length (6) and the encoder sequence length (8).
In cross-attention, we only need to apply padding masks to ignore padding tokens in both sequences:
Menc-pad=[0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞0000−∞−∞−∞−∞]
Mdec-pad=[00000000000000000000000000000000−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞−∞]
The combined mask M_{\text{cross}} = M_{\text{enc-pad}} + M_{\text{dec-pad}} is then:
M_{\text{cross}} = \begin{bmatrix} 0 & 0 & 0 & 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty & -\infty & -\infty & -\infty \\ 0 & 0 & 0 & 0 & -\infty & -\infty & -\infty & -\infty \\ -\infty & -\infty & -\infty & -\infty & -\infty & -\infty & -\infty & -\infty \\ -\infty & -\infty & -\infty & -\infty & -\infty & -\infty & -\infty & -\infty \end{bmatrix}
The final \color{green}Value update in cross-attention with padding masks applied will then look like:
\color{darkgreen}{V^{'}} = \text{softmax}\left({\frac{\color{purple}Q\color{darkorange}K^T}{\sqrt{d_k}}+M_\text{cross}}\right)\color{darkgreen}V
This cross-attention mechanism allows the decoder to focus on relevant parts of the encoder’s output at each decoding step. For example:
Unlike in self-attention, there is no causal mask in cross-attention because we want each decoder position to have access to the entire source sequence.
The output of the transformer decoder is a matrix O \in \mathbb{R}^{n_{\text{tokens}}\times|\mathscr{D}|}, where n_{\text{tokens}} is the number of tokens in the target sequence and |\mathscr{D}| is the size of the vocabulary (dictionary). Each row i of this matrix contains the probability distribution over all possible tokens in the vocabulary that could follow the sequence up to position i.
More formally, for each position i in the target sequence:
O_{i,j} = P(\text{token}_i = \mathscr{D}_j \mid \text{token}_1, \text{token}_2, ..., \text{token}_{i-1})
Where \mathscr{D}_j represents the j-th token in the vocabulary \mathscr{D}.
Since these values represent a probability distribution, they must satisfy:
\sum_{j=1}^{|\mathscr{D}|} O_{i,j} = 1 \quad \forall i \in {1,2,\dots,n_{\text{tokens}}}
This probability distribution is generated by applying a linear transformation to the decoder’s final representation, followed by a softmax function. During training, these probabilities are compared to the actual next tokens in the target sequence using cross-entropy loss, which encourages the model to assign high probabilities to the correct next tokens.
It’s important to note that during inference (as opposed to training), we don’t have access to the entire target sequence in advance. Instead, we generate tokens one by one, using each generated token as input to predict the next one in an autoregressive manner.
Here’s how the decoder input evolves during inference for our translation example:
Step 1: We start with just the start-of-sequence token and padding:
Input: ["<SOS>", "<PAD>", "<PAD>", "<PAD>", "<PAD>", "<PAD>"]
The model (hopefully) predicts “Manuskripte” as the most likely first token
Step 2: We append the predicted token to our sequence:
Input:["<SOS>", "Manuskripte", "<PAD>", "<PAD>", "<PAD>", "<PAD>"]
The model predicts “brennen” as the most likely second token
Step 3: We append the newly predicted token:
Input: ["<SOS>", "Manuskripte", "brennen", "<PAD>", "<PAD>", "<PAD>"]
The model predicts “nicht” as the most likely third token
Step 4: We append the newly predicted token:
["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"]
At this point, seeing the end-of-sequence token, we know the translation is complete: “Manuskripte brennen nicht” (Manuscripts don’t burn).
This autoregressive process—where each prediction depends on all previous predictions—is fundamentally different from training, where we can use the ground truth sequence to teach the model all at once.
On Masking During Inference: You might wonder if the causal mask is still necessary during inference since future positions contain only padding tokens, which would be masked by the padding mask anyway. You’re right that the padding mask alone would technically prevent information flow from future padding tokens. However, the causal mask is still maintained most of the time during inference for implementation consistency.
Through this blog-post we’ve unpacked the elegant simplicity behind the Transformer architecture, demonstrating how self-attention mechanisms allow models to capture contextual relationships between tokens.
By understanding these fundamental building blocks you should now possess the conceptual framework necessary to navigate the rapidly evolving landscape of large language models!
For attribution, please cite this work as
Bonvini (2025, May 1). Last Week's Potatoes: Attention is all you need. Retrieved from https://lastweekspotatoes.com/posts/2025-03-01-attention-is-all-you-need/
BibTeX citation
@misc{bonvini2025attention, author = {Bonvini, Andrea}, title = {Last Week's Potatoes: Attention is all you need}, url = {https://lastweekspotatoes.com/posts/2025-03-01-attention-is-all-you-need/}, year = {2025} }