Decoder

Quelle: Paper - Attention Is All You Need

The Decoder block in a Transformer model is responsible for generating the output sequence. It operates similarly to the Encoder but with some additional mechanisms. The Decoder consists of a stack of identical layers, each with three sub-layers: a masked multi-head self-attention mechanism, a multi-head attention over the output of the Encoder, and a position-wise fully connected feed-forward network.

Masked Multi-Head Self-Attention Mechanism (Attention > Multi-Head Attention): This mechanism allows the model to focus on different positions in the input when generating each position in the output. However, unlike in the Encoder, it is "masked" to prevent positions from attending to subsequent positions, ensuring that predictions for position i can depend only on known outputs at positions less than i.
Multi-Head Attention Over Encoder Output: This sub-layer performs multi-head attention over the output of the Encoder stack. The queries come from the previous decoder layer, and keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
Position-wise Fully Connected Feed-Forward Network: This is identical to that in the Encoder block.

Each sub-layer has a residual connection around it followed by layer normalization.

At each step, it uses its current input and all previously generated outputs to generate its next output.

While decoding, this block generates an element of an output sequence at each time step while being conditioned on both inputs and all previously generated elements of that sequence, thereby creating an auto-regressive model which generates sequences one part at a time.

Decoder-Only Models

Decoder-only models, like the Generative Pretrained Transformer (GPT), are a variant of the Transformer model that only use the decoder part. These models are designed for tasks that require generation of text, such as language modeling and text completion.

In GPT, the input is processed in a left-to-right manner and each token is predicted based on its preceding tokens. This differs from the standard Transformer model, which processes all tokens in parallel and uses both preceding and following tokens to predict a given token.

The masked self-attention mechanism in GPT ensures that a given token can only attend to previous positions in the input sequence, maintaining auto-regressive property. This makes it suitable for tasks like language modeling where future tokens are not available at prediction time.

The advantage of decoder-only models like GPT is their ability to generate coherent and contextually relevant sentences by effectively capturing long-range dependencies between words. However, they may not be as effective for tasks that require understanding the interdependencies between all words in a sentence or paragraph, as they only consider preceding context.

Decoder Blocks in GPT

The decoder block in GPT is similar to the standard Transformer decoder block, but with some modifications. It consists of a stack of identical layers, each with two sub-layers:

Masked Multi-Head Self-Attention Mechanism (Attention > Multi-Head Attention in GPT): This mechanism allows the model to focus on different positions in the input when generating each position in the output. However, it is "masked" to prevent positions from attending to subsequent positions, ensuring that predictions for position i can depend only on known outputs at positions less than i.
Position-wise Fully Connected Feed-Forward Network: This is identical to that in the encoder block.

Each sub-layer has a residual connection around it followed by layer normalization.

Unlike the standard Transformer decoder, GPT does not have a second attention mechanism over the encoder output because there is no separate Encoder phase. Instead, all inputs are processed by the decoder in an auto-regressive manner.

At each step, GPT uses its current input and all previously generated outputs to generate its next output. The masked self-attention mechanism ensures that a given token can only attend to previous positions in the input sequence, maintaining auto-regressive property.**`