Encoder

dl_model_transformer_encoder.png
Quelle: Paper - Attention Is All You Need

The Encoder Block is a fundamental component of the Transformer model, which is widely used in natural language processing tasks such as machine translation, text summarization, and sentiment analysis. The Transformer model is based on the Encoder-Decoder architecture, where the encoder block plays a crucial role in understanding and representing input data.

The Encoder Block consists of two main parts: a self-attention mechanism (Attention > Multi-Head Attention) and a feed-forward neural network. Each part is followed by a layer normalization process and residual connections for optimization.

  1. Self-Attention Mechanism: This part of the encoder block helps the model to understand the context of each word in a sentence by assigning attention scores to all words. It calculates these scores based on how much focus should be placed on other words to understand a particular word in the sentence.

  2. Feed-forward Neural Network: After the self-attention mechanism, the output goes through this fully connected neural network. The same feed-forward network is used for each position separately and identically, which allows parallelization during training.

The input data to an Encoder Block is first embedded into vectors. These vectors are then passed through multiple layers (often known as 'stacks') of these Encoder Blocks. Each layer captures different levels of abstraction in the data, which helps in better understanding and representation of complex patterns within it.

All encoder blocks are identical, but do not share weights with each other. The output from one block acts as input to the next one, creating multiple levels of encoding that allow for more complex relationships between words or parts of speech to be captured.

Encoder Only Models

Some Transformer models, such as BERT (Bidirectional Encoder Representations from Transformers) utilize only the encoder part of the architecture. These models are often used for tasks like text classification, sentiment analysis, and named entity recognition.

In these cases, the encoder blocks process the input data and generate a representation that captures the context of each word in the input. This representation is then passed to a final output layer for prediction.

BERT uses multiple encoder blocks to understand the context from both left and right (bidirectional understanding) of a word in a sentence. It has been pre-trained on two tasks: masked language model and next sentence prediction.