The attention mechanism allows models to focus on specific parts of the input sequence when producing an output. It assigns different importance to different parts of the input, enabling the model to "attend" to the most relevant parts of the sequence.
The attention mechanism works by using three key components: Query (Q), Key (K), and Value (V). These components are used to compute how much focus each part of the input should have in the output.
The attention mechanism relies on the following vectors:
The attention score between the query and key vectors is calculated as follows:
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V \]
Where:
Here’s the key information used for this explanation:
Word Embedding for "I": [0.2, 0.5, 0.7] Word Embedding for "love": [0.4, 0.8, 0.6] Word Embedding for "statistics": [0.1, 0.6, 0.9]
Query Weight Matrix (Wq): [0.1, 0.3] [0.2, 0.4] [0.5, 0.6] Key Weight Matrix (Wk): [0.6, 0.2] [0.3, 0.5] [0.7, 0.9] Value Weight Matrix (Wv): [0.5, 0.8] [0.2, 0.3] [0.4, 0.6]
We compute the weighted key, value, and query vectors by multiplying each word's embedding with the respective weight matrices. The word "love" is the query, so the query calculation is only done for "love". For other words, we calculate the weighted key and value vectors.
Weighted Query for "love": (Embedding of "love" transposed) * Wq = [0.4, 0.8, 0.6] * Wq = [0.59, 0.78]
Similarly, we calculate the weighted key and value vectors for all words ("I", "love", "statistics"). Here are the results:
Weighted Key Vector for "I": [0.48, 0.54] Weighted Key Vector for "love": [0.68, 0.82] Weighted Key Vector for "statistics":[0.77, 1.12] Weighted Value Vector for "I": [0.66, 0.82] Weighted Value Vector for "love": [1.02, 1.10] Weighted Value Vector for "statistics":[0.66, 1.02]
We calculate the unnormalized attention scores by taking the dot product between the weighted query vector for "love" and the weighted key vectors for each word.
Attention("love", "I") = 0.5152 Attention("love", "love") = 0.7178 Attention("love", "statistics") = 0.8934
Now, we apply the softmax function to convert the unnormalized scores into probabilities:
Attention("love", "I") = 0.3012 Attention("love", "love") = 0.3675 Attention("love", "statistics") = 0.3313
Notice that the sum of the attention scores equals 1:
0.3012 + 0.3675 + 0.3313 = 1
Finally, we calculate the context vector by multiplying the normalized attention scores with the corresponding weighted value vectors for each word:
Context Vector for "I" = [0.3012 * 0.66, 0.3012 * 0.82] = [0.1988, 0.2460] Context Vector for "love" = [0.3675 * 1.02, 0.3675 * 1.10] = [0.3743, 0.4042] Context Vector for "statistics" = [0.3313 * 0.66, 0.3313 * 1.02] = [0.2184, 0.3377]