The three kinds of Attention possible in a model:
-
Encoder-Decoder Attention: Attention between the input sequence and the output sequence.
-
Self attention in the input sequence: Attends to all the words in the input sequence.
-
Self attention in the output sequence: One thing we should be wary of here is that the scope of self attention is limited to the words that occur before a given word. This prevents any information leaks during the training of the model. This is done by masking the words that occur after it for each step. So for step 1, only the first word of the output sequence is NOT masked, for step 2, the first two words are NOT masked and so on.