A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
The Transformer Architecture
The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output. The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output.
The encoder consists of a stack of N= 6 identical layers, where each layer is composed of two sublayers:
The first sublayer implements a multi-head self-attention mechanism. We had seen that the multi-head mechanism implements h heads that receive a (different) linearly projected version of the queries, keys and values each, to produce h outputs in parallel that are then used to generate a final result.
The second sublayer is a fully connected feed-forward network, consisting of two linear transformations with Rectified Linear Unit (ReLU) activation in between:
FFN(x)= ReLU(W1x+b1)W2 + b2
The six layers of the Transformer encoder apply the same linear transformations to all of the words in the input sequence, but each layer employs different weight (W1,W2) and bias (b1,b2) parameters to do so.
Furthermore, each of these two sublayers has a residual connection around it.
Each sublayer is also succeeded by a normalization layer, layernorm(.), which normalizes the sum computed between the sublayer input, x, and the output generated by the sublayer itself, sublayer(x):
layernorm(x + sublayer(x))
An important consideration to keep in mind is that the Transformer architecture cannot inherently capture any information about the relative positions of the words in the sequence, since it does not make use of recurrence. This information has to be injected by introducing positional encodings to the input embeddings.
The positional encoding vectors are of the same dimension as the input embeddings, and are generated using sine and cosine functions of different frequencies. Then, they are simply summed to the input embeddings in order to inject the positional information.