Transformer

dl_model_transformer.png
Quelle: Paper - Attention is all you need

The Transformer architecture is a type of model architecture used in the field of deep learning, particularly for tasks related to natural language processing (NLP). It was introduced in the paper "Attention is All You Need" by Vaswani et al. (Paper: Attention Is All You Need) , and has since been used as the backbone for many state-of-the-art models such as BERT, GPT-2, and T5.

The Transformer model is based on self-attention mechanisms and does not rely on recurrence like RNNs or convolution like CNNs. This makes it highly parallelizable and efficient to train.

The architecture consists of an encoder-decoder structure:

  1. Encoder: The input sequence is passed through N identical layers, which are made up of two sub-layers: a multi-head self-attention mechanism, and a simple, position-wise fully connected feed-forward network. There is a residual connection around each sub-layer followed by layer normalization.

  2. Decoder: The decoder also has N identical layers but with an additional sub-layer that performs multi-head attention over the output of the encoder stack. Similar to the encoder, there are residual connections around each sub-layer followed by layer normalization.

  3. Positional Encoding: Since the model contains no recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

  4. Attention > Multi-Head Attention: This mechanism allows the model to focus on different positions simultaneously, resulting in better context capture.

  5. Feed Forward Networks: These are fully connected networks for processing.

  6. Output Layer: The final layer is a Linear Layer followed by a Softmax function to generate predictions.

A Transformer Model consists of multiple Encoder and Decoder layers. Each layer is responsible for encoding and decoding the input data, respectively. The Encoder layer processes the input data and converts it into a format that can be understood by the model, while the Decoder layer translates this encoded data back to a readable format.

It also uses a mechanism called Attention, which allows it to focus on different parts of the input sequence when generating an output sequence. This attention mechanism enables the model to handle long sequences of data more effectively.

The whole construct is highly parallelizable and can process large amounts of data simultaneously, making it faster and more efficient than other models such as Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks.

This model has been widely used in various Natural Language Processing (NLP) tasks such as machine translation, text summarization, sentiment analysis, and more. It has also been used in creating AI models for generating human-like text, like OpenAI's GPT-3.