Last Week's Potatoes: Attention is all you need

Andrea Bonvini

Processing math: 91%

Attention is all you need

A primer on the Transformer architecture.

Author

Affiliation

Andrea Bonvini

Published

April 30, 2025

Citation

Bonvini, 2025

Introduction

The seminal paper “Attention Is All You Need” by Vaswani et al. quietly revolutionised the world of AI when it appeared in 2017. Little did its authors know they were laying the foundations for the creation of today’s Large Language Models, along with AI’s entrance into the mainstream. But before all that chaos, there was just this elegant idea about attention mechanisms that genuinely changed everything.

Why is this paper important?

The work presented in the paper was initially conceived to find a more efficient and performant solution to the problem of sequence-to-sequence modeling, a machine learning task that involves converting an input sequence into an output sequence, potentially of different lengths. The experiments from the paper presented results for English-to-German and English-to-French translation tasks. The state-of-the-art solutions at the time relied mainly on complex building blocks such as Recurrent Neural Networks and Convolutional Neural Networks, along with attention mechanisms first introduced in the 2014 paper “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau, Cho, and Bengio). The main hypothesis, as the title of the paper states, is that “Attention is all you need,” and by introducing a novel architecture called the Transformer, which exclusively leverages highly parallelizable and efficient attention mechanisms, it is possible to achieve state-of-the-art results on machine translation tasks without using recurrence or convolution.

Background

Tokenization and Embedding

Before trying to understand the inner workings of the Transformer architecture you should be familiar with a couple of preprocessing operations needed in order to feed text to a neural network. Here we will introduce some lexical terms that you will find all over the literature so, if you are a beginner in the field, take your time to digest this section.

Let’s say we have an input sentence such as “Manuscripts don’t burn.”, this text has to be converted into an input sequence composed of multiple tokens, this procedure is called tokenization. A token can be a word, or a sub-word, depending on the tokenization technique (Byte-Pair Encoding in this case, but we will not further discuss it in this blog post).

“Manuscripts don’t burn.” → [“Man”, ”uscript”, “s”, “ don”, “’t”, “ burn”]

Here the original text was sub-divided in 6 tokens. We have a fixed number of possible tokens, which represents the cardinality of our dictionary ( e.g., in the paper a shared vocabulary of 37000 tokens was used for the English-to-German translation experiments, and a shared vocabulary of 32000 tokens was used for the English-to-French translation experiments). Once we have our tokens, we will map each token into a randomly initialised vector of dimension , we call these vectors embeddings.

Positional Encoding

Given that the Transformer architecture aims to process all tokens in parallel, and these randomly initialized embeddings contain no information regarding the position of the tokens in the sentence (we would like the model to be able to distinguish between “Manuscripts don’t burn” and ” don’t burn Manuscripts”), we add a positional encoding vector to each embeddding vector to obtain the final input embeddings.

We obtain these positional encoding vectors in the following way:

Just for completeness:

Here is a visualization of how these positional embeddings would look like for the first embeddings where

After reading this section you may already have many questions like:

Why are we adding these positional encodings to the randomly initialized embedding vectors instead of concatenating them?
Why was this formulation chosen over other ones? Does it have any desirable properties?
Is this still the chosen method to enforce position information in the input embeddings as today (2025)?

But this would require a separate blog post, so I’m asking you to proceed with this article without questioning the validity of such positional encodings.

Attention Mechanism

For simplicity, throughout this blog post, we’ll assume that each word in a sentence corresponds to a different token. Keep in mind that in practice, tokenization is more complex.

In the following sections, all vectors are represented as lowercase bold and are considered to be column vectors.

Scaled Dot Product Attention

Let’s develop an intuition about the most important building block of the entire architecture: the scaled dot-product attention mechanism. This component allows each word embedding to pay attention to surrounding words (its context) and update itself to better represent its true semantic meaning.

Consider these two example sentences:

“I want to wear my jacket”
“Today the weather is quite ”

Before training the model the two word-embeddings for the word cool would be the same, even though they carry a well distinct meaning. As humans we understand that in the first sentence means “fashionable”, since it’s juxtaposed to the word “jacket”; and in the second sentence means “chilly” since it refers to “weather”. So the way we’ll formalize this attention mechanism should allow for different words in the sentence to influence each other.

The “scaled dot-product attention” presented in the paper takes three input matrices:

A query matrix
A key matrix
A value matrix

These matrices have the following dimensions:

We’ll define and more precisely later, but for now, understand that these represent projections of the original embedding vectors, where .

Understanding Q, K, and V Intuitively

To grasp the role of these matrices, think of them as participating in an information exchange system:

Each row of the matrix contains a query vector for token . This vector represents what information token is “asking for” from other tokens in the sequence.
Each row of the matrix contains a key vector for token . This vector represents what information token can “offer” to other tokens’ queries.
Each row of the matrix contains a value vector for token . This vector represents the actual information content that token contributes when its key matches a query.

The attention mechanism works by computing how relevant each token is to each token by measuring the similarity between and . When this similarity is high, token will incorporate more of token ’s value vector into its contextual representation.

In our example with “cool,” the query vector for “cool” in the first sentence might strongly match key vectors from fashion-related words like “jacket,” causing its representation to shift toward the “fashionable” meaning. Similarly, in the second sentence, “cool” might match strongly with weather-related words, shifting its representation toward the “chilly” meaning.

Mathematical Formulation

The values are updated as follows:

Taking a closer look at the nominator of the softmax argument, we see that it consists of a sort of similarity matrix between the query vectors stored in and the key vectors stored in

After normalizing the values by and taking the we see that the new value vectors stored in will be the result of a weighted sum of the old ones.

And, as anticipated, we may have that the old value vector for the word “weather” will have the chance to influence the new value vector for the word “cool” .

In case you are wondering the rational behind the denominator of the softmax argument, the authors claim the following:

——————————————————————————————————————————————–

We suspect that for large values of , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by . To illustrate why the dot products get large, assume that the components of and are independent random variables with mean and variance . Then their dot product, , has mean and variance .

——————————————————————————————————————————————–

From Input Embeddings to Attention Matrices

Now that we understand how the scaled dot-product attention mechanism works, a natural question arises: how do we obtain the , , and matrices in the first place?

In the Transformer architecture, these matrices can be derived in different ways depending on whether we’re computing self-attention or cross-attention. For now, let’s focus on self-attention, as it’s conceptually simpler to grasp. We’ll introduce cross-attention later when discussing the Transformer’s encoder and decoder blocks.

Self-Attention: Same Source, Different Projections

In self-attention, all three matrices , , and are derived from the same input sequence, which corresponds to the input embeddings, that gets projected through learned linear transformations:

- Projects input embeddings into query space of dimension .
- Projects input embeddings into key space of dimension .
- Projects input embeddings into value space of dimension .

Note that the query and key spaces intentionally share the same dimensionality (). This is because we compute dot products between queries and keys, which requires matching dimensions. The dimensionality of the values can be different even though it’s common to have .

Mathematical Formulation

If we denote our input embeddings as matrix , where is the sequence length and is the embedding dimension, then:

With these projections, each token in our sequence now has corresponding query, key, and value vectors. These are then used in the scaled dot-product attention formula we covered earlier.

Multi-Head Attention

The authors of the paper take the scaled dot-product attention concept a step further with Multi-Head Attention, which allows the model to jointly attend to information from different representation subspaces.

In Multi-Head Attention, instead of performing a single attention function, the model performs attention multiple times in parallel. Each of these parallel attention operations is called a “head”, and each head learns to focus on different aspects of the relationships between tokens.

For each head from a total of heads, we create separate projection matrices:

- Projects input embeddings into query space.
- Projects input embeddings into key space.
- Projects input embeddings into value space.

Mathematical Formulation

Denoting again our input token embeddings as matrix . For each attention head , we compute:

Then, we apply the Scaled Dot Product attention to each head independently:

The outputs from each head are concatenated and then projected once more using a final output projection matrix :

Why Multiple Heads?

Multiple attention heads allow the model to capture different types of relationships simultaneously:

Some heads might focus on local relationships between adjacent words.
Others might capture long-distance dependencies.
Some might attend to syntactic relationships, while others focus on semantic meaning.

Generally speaking, this approach helps the model build richer representations of the input text.

In practice, the paper uses parallel attention heads for the encoder and decoder layers, with .

Layer Normalization

Layer Normalization is another crucial component of the Transformer architecture that helps stabilize and accelerate training. Unlike Batch Normalization, which normalizes across the batch dimension, LayerNorm operates on the feature dimension for each individual sample.

Mathematical Formulation

For an input vector (representing a single token’s embedding), LayerNorm applies the following transformation:

Where:

is the mean of the features
is the variance of the features
is a small constant added for numerical stability
are learnable parameters (scale and shift)
denotes element-wise multiplication

When applied to a matrix of token embeddings , LayerNorm processes each row independently, normalizing across the embedding dimension.

The main purpose of Layer Normalization is to prevent exploding or vanishing gradients by ensuring that the input to each sub-layer of the full transformer architecture has consistent statistical properties (that allow for faster convergence as well!).

Feed Forward Layer

The last building block that will help us compose the whole Transformer architecture is the Position-wise Feed Forward Network (FFN). Despite its simplicity, this component contributes significantly to the model’s expressive power.

Mathematical Formulation

The Feed Forward Network consists of two linear transformations with a ReLU activation in between:

Where:

is the input vector for a single token
and are weight matrices
and are bias vectors
is the inner dimension of the feed-forward network (typically 4 times larger than )

Crucially, this FFN is applied to each position (token) separately and identically - hence “position-wise”. When working with a sequence of tokens represented as a matrix , the FFN processes each row independently, using the same set of parameters .

The Position-wise FFN serves mainly to introduce non-linearity through the ReLU activation, increasing the model’s capacity to learn complex functions.

In the paper, the authors use with , creating a significant expansion in the intermediate representation.

Complete Transformer Architecture

Now that we have a grasp of how the Multi Head Attention, Layer Normalization, and Feed Forward Network blocks work, we can visualize the full Transformer architecture, composed of both an encoder and a decoder block. In the context of neural machine translation, the encoder’s responsibility is to extract information from the original language (e.g., English) sentence, while the decoder has to predict the words (tokens) that will compose the sentence in the target language (e.g., German). We’ll use as an example the following English to German sentence pair:

Here is an illustration that depicts the flow of information from the encoder to the decoder during training: (Click on the image to enlarge it if you are reading from the browser)

Encoder Structure

The encoder is composed of a stack of identical layers. Each layer has two sub-layers:

A multi-head self-attention mechanism, where , , and are all derived from the output of the previous layer (or input embeddings for the first layer)
A position-wise fully connected feed-forward network

We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is , where is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension .

Encoder Input during Training

The encoder processes a sequence of tokens with a fixed maximum length, known as the encoder’s context length. This parameter defines the upper limit of input tokens the model can handle in a single forward pass. In our toy example, we set this context length to , while the original Transformer paper implementation uses (as seen in the official TensorFlow implementation). The context length should be chosen to accommodate the longest input sequences expected in your dataset. For sequences shorter than the context length, a special padding token “<PAD>” is added to fill the remaining positions. These padding tokens are typically masked out in the attention mechanism to prevent them from influencing the representations of actual content tokens.

How does this masking happen in practice? Coming back to our example we may have the following “similarity” matrix between the uery and ey vectors from the encoder:

Since the dot products that contain any “<PAD>” token shouldn’t influence the final alue vectors we add a padding mask to ensure padded tokens don’t influence the final attention matrix:

The final alue update with padding mask applied will then look like this:

If you need a refresher on the softmax function to understand why terms will map to (and have null contribution) here it is:

Encoder Input during Inference

Same as in training.

Decoder Structure

The decoder is also composed of a stack of identical layers. However, each decoder layer has three sub-layers instead of two:

A masked multi-head self-attention mechanism, where , , and all come from the decoder’s previous layer output (or input embeddings for the first layer). The masking ensures that predictions for a position can only depend on known outputs at earlier positions (read the sub-section below for more details)
A multi-head cross-attention mechanism, where comes from the output of the decoder’s first sub-layer, while and come from the encoder’s output. This allows the decoder to focus on relevant parts of the input sequence.
A position-wise fully connected feed-forward network analogous to the one in the encoder.

As in the encoder, we apply residual connections around each sub-layer followed by layer normalization.

Decoder Input during Training

The decoder takes as input the right-shifted target sequence (we add a “<SOS>” - Start Of Sentence - token at the beginning of the sequence). In our example, the decoder input sequence is ["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"], where we’ve padded to match our context length .

Why right-shift the input? This right-shifting is crucial during training because of how the decoder learns to generate text. The decoder’s job is to predict the next token in the sequence based on what came before.

During training, we need to:

Give the decoder the tokens it should have predicted so far (to learn from)
Ask it to predict the next token in the sequence
Compare its prediction with the actual next token

Without right-shifting, the decoder would see the token it’s trying to predict! By shifting the target sequence right (adding a start token at the beginning), each position in the decoder can only attend to previous tokens, preserving the causal nature of language generation. This means at position 1, the decoder sees [“<SOS>”] and must predict “Manuskripte”, at position 2, the decoder sees [“<SOS>”, “Manuskripte”] and must predict “brennen” - and so on…

Masked self-attention

For the decoder’s self-attention, we need to apply both:

A padding mask (similar to the encoder) to ignore padded tokens
A causal mask to prevent tokens from attending to future positions

Let’s first visualize the “similarity” matrix between uery and ey vectors in the decoder’s self-attention:

Now, we need to apply two masks:

First, the causal mask ensures that tokens only attend to previous positions:

Second, the padding mask prevents attention to and from padding tokens:

The combined mask is then:

The final alue update with both masks applied will then look like:

Cross-attention

The cross-attention mechanism allows the decoder to focus on relevant parts of the encoder’s output while generating each token. Unlike the decoder’s self-attention, cross-attention does not require causal masking because the entire source sequence is already available. In our example, the decoder input ["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"] needs to attend to the encoder output for ["Manuscripts", "don", "'t", "burn", "<PAD>", "<PAD>", "<PAD>", "<PAD>"].

In cross-attention, the ueries come from the decoder, while the eys and alues come from the encoder. This enables each decoder position to attend to all encoder positions:

Notice that the resulting attention matrix has dimensions , reflecting the decoder sequence length (6) and the encoder sequence length (8).

In cross-attention, we only need to apply padding masks to ignore padding tokens in both sequences:

The encoder padding mask prevents attention to encoder padding tokens:

The decoder padding mask prevents attention from decoder padding tokens:

The combined mask is then:

The final alue update in cross-attention with padding masks applied will then look like:

This cross-attention mechanism allows the decoder to focus on relevant parts of the encoder’s output at each decoding step. For example:

When generating “Manuskripte”, the decoder might focus heavily on the encoder’s representation of “Manuscripts”
When generating “brennen”, it might attend more to “burn”
When generating “nicht”, it might attend to both “don” and “’t”

Unlike in self-attention, there is no causal mask in cross-attention because we want each decoder position to have access to the entire source sequence.

Decoder Output

The output of the transformer decoder is a matrix , where is the number of tokens in the target sequence and is the size of the vocabulary (dictionary). Each row of this matrix contains the probability distribution over all possible tokens in the vocabulary that could follow the sequence up to position .

More formally, for each position in the target sequence:

Where represents the -th token in the vocabulary .

Since these values represent a probability distribution, they must satisfy:

This probability distribution is generated by applying a linear transformation to the decoder’s final representation, followed by a softmax function. During training, these probabilities are compared to the actual next tokens in the target sequence using cross-entropy loss, which encourages the model to assign high probabilities to the correct next tokens.

Decoder Input during Inference

It’s important to note that during inference (as opposed to training), we don’t have access to the entire target sequence in advance. Instead, we generate tokens one by one, using each generated token as input to predict the next one in an autoregressive manner.

Here’s how the decoder input evolves during inference for our translation example:

Step 1: We start with just the start-of-sequence token and padding:

Input: ["<SOS>", "<PAD>", "<PAD>", "<PAD>", "<PAD>", "<PAD>"]
The model (hopefully) predicts “Manuskripte” as the most likely first token

Step 2: We append the predicted token to our sequence:

Input:["<SOS>", "Manuskripte", "<PAD>", "<PAD>", "<PAD>", "<PAD>"]
The model predicts “brennen” as the most likely second token

Step 3: We append the newly predicted token:

Input: ["<SOS>", "Manuskripte", "brennen", "<PAD>", "<PAD>", "<PAD>"]
The model predicts “nicht” as the most likely third token

Step 4: We append the newly predicted token:

Input: ["<SOS>", "Manuskripte", "brennen", "nicht", "<PAD>", "<PAD>"]
The model predicts “<EOS>” (end-of-sequence) as the most likely fourth token

At this point, seeing the end-of-sequence token, we know the translation is complete: “Manuskripte brennen nicht” (Manuscripts don’t burn).

This autoregressive process—where each prediction depends on all previous predictions—is fundamentally different from training, where we can use the ground truth sequence to teach the model all at once.

On Masking During Inference: You might wonder if the causal mask is still necessary during inference since future positions contain only padding tokens, which would be masked by the padding mask anyway. You’re right that the padding mask alone would technically prevent information flow from future padding tokens. However, the causal mask is still maintained most of the time during inference for implementation consistency.

Conclusion

Through this blog-post we’ve unpacked the elegant simplicity behind the Transformer architecture, demonstrating how self-attention mechanisms allow models to capture contextual relationships between tokens.

By understanding these fundamental building blocks you should now possess the conceptual framework necessary to navigate the rapidly evolving landscape of large language models!

Citation

For attribution, please cite this work as

Bonvini (2025, May 1). Last Week's Potatoes: Attention is all you need. Retrieved from https://lastweekspotatoes.com/posts/2025-03-01-attention-is-all-you-need/

BibTeX citation

@misc{bonvini2025attention,
  author = {Bonvini, Andrea},
  title = {Last Week's Potatoes: Attention is all you need},
  url = {https://lastweekspotatoes.com/posts/2025-03-01-attention-is-all-you-need/},
  year = {2025}
}

Attention is all you need

Author

Affiliation

Published

Citation

Introduction

Why is this paper important?

Background

Tokenization and Embedding

Positional Encoding

Attention Mechanism

Scaled Dot Product Attention

Understanding Q, K, and V Intuitively

Mathematical Formulation

From Input Embeddings to Attention Matrices

Self-Attention: Same Source, Different Projections

Mathematical Formulation

Multi-Head Attention

Mathematical Formulation

Why Multiple Heads?

Layer Normalization

Mathematical Formulation

Feed Forward Layer

Mathematical Formulation

Complete Transformer Architecture

Encoder Structure

Encoder Input during Training

Encoder Input during Inference

Decoder Structure

Decoder Input during Training

Masked self-attention

Cross-attention

Decoder Output

Decoder Input during Inference

Conclusion

Footnotes

Citation