Attention

Attention in the Transformer is a mechanism that helps the model to focus on certain parts of the input data that are relevant for making predictions. It allows the model to weigh the importance of different elements in the input sequence, giving more "attention" to those that are more significant. This is particularly useful in tasks like machine translation where understanding the context is crucial. Attention mechanism improves the ability of Transformers to handle long-range dependencies in sequence data, making them more effective and efficient in processing sequential information.

Scaled Dot-Product Attention

Scaled Dot-Product Attention is a type of attention mechanism used in Transformers. It calculates the attention score by taking the dot product of the query and key, then scales it down by dividing by the square root of the dimension of the key vectors. This scaling is done because for large values of dimensions, the dot product grows large in magnitude, pushing the Softmax function into regions where it has extremely small gradients.

The formula for Scaled Dot-Product Attention is:

Where:

  • represents Query
  • represents Key
  • represents Value
  • is the dimensionality of queries and keys

Firstly, it computes a dot product between Query and Key to generate a score matrix, then applies a softmax function to normalize scores so they add up to 1. The result is multiplied with Value to get a weighted sum which signifies how much focus to place on different parts of input sequence.

Multi-Head Attention

Multi-Head Attention is a mechanism used in Transformers that allows the model to focus on different positions of the input sequence simultaneously. It operates by applying the attention mechanism multiple times in parallel, with each application referred to as a "head".

The idea behind Multi-Head Attention is that each head will learn to focus on different aspects of the input data. For example, one head may focus on syntactic features while another may focus on semantic features. This makes it possible for the model to capture various types of information from the input data.

In practice, Multi-Head Attention is implemented by linearly projecting the queries, keys, and values h times with different learned linear projections to , and dimensions respectively.

The formula for Multi-Head Attention is:

where each head_i is computed as follows:

Here:

  • Q represents Query
  • K represents Key
  • V represents Value
  • W are weight matrices unique to each head

Afterwards all these h parallel output heads are concatenated and linearly transformed, resulting in final values. This mechanism allows Transformer models to pay attention to information at different positions from different representational spaces simultaneously.

Multi-Head Attention in GPT

GPT (Generative Pretrained Transformer) also uses multi-head attention, but with a slight modification known as "masked" self-attention or causal attention. In contrast to regular multi-head attention, which allows a position to attend to all positions before and after it, in GPT's masked self-attention, a position is only allowed to attend to earlier positions in the sequence.

This modification is crucial for tasks that need the model to generate outputs one token at a time, such as text generation. By ensuring that the prediction for position i can only depend on the known outputs at positions less than i, GPT can be used for generating sequences autoregressively.

In practice, this is implemented by modifying the self-attention sub-layer in the Transformer. When calculating the attention scores, any values that come from future tokens are set to negative infinity before applying Softmax, effectively masking them out and preventing information flow from future tokens.

The introduction of this masked self-attention mechanism is one of the key factors that have made GPT and its successors so successful in natural language processing tasks. It enables the model to generate coherent and contextually relevant text by ensuring that each token generated is conditioned on the preceding tokens.