Attention Mechanism Explained

3. Attention Score Calculation

The attention score between the query and key vectors is calculated as follows:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V \]

Where:

Q * K^T: Dot product of the query and key vectors.
d_k: The dimension of the key vectors (usually the number of features in K).
The softmax function ensures the attention scores are converted into probabilities that sum to 1.

Key Information

Here’s the key information used for this explanation:

Word Embeddings (1x3):

      Word Embedding for "I": [0.2, 0.5, 0.7]
      Word Embedding for "love": [0.4, 0.8, 0.6]
      Word Embedding for "statistics": [0.1, 0.6, 0.9]

Weight Matrices (3x2):

      Query Weight Matrix (Wq):
      [0.1, 0.3]
      [0.2, 0.4]
      [0.5, 0.6]

      Key Weight Matrix (Wk):
      [0.6, 0.2]
      [0.3, 0.5]
      [0.7, 0.9]

      Value Weight Matrix (Wv):
      [0.5, 0.8]
      [0.2, 0.3]
      [0.4, 0.6]

Step 1: Compute Weighted Key, Value, and Query Vectors

We compute the weighted key, value, and query vectors by multiplying each word's embedding with the respective weight matrices. The word "love" is the query, so the query calculation is only done for "love". For other words, we calculate the weighted key and value vectors.

Query Vector for "love"

      Weighted Query for "love": (Embedding of "love" transposed) * Wq
      = [0.4, 0.8, 0.6] * Wq = [0.59, 0.78]

Key and Value Vectors

Similarly, we calculate the weighted key and value vectors for all words ("I", "love", "statistics"). Here are the results:

      Weighted Key Vector for "I":         [0.48, 0.54]
      Weighted Key Vector for "love":      [0.68, 0.82]
      Weighted Key Vector for "statistics":[0.77, 1.12]

      Weighted Value Vector for "I":         [0.66, 0.82]
      Weighted Value Vector for "love":      [1.02, 1.10]
      Weighted Value Vector for "statistics":[0.66, 1.02]

Step 2: Calculate Unnormalized Attention Scores

We calculate the unnormalized attention scores by taking the dot product between the weighted query vector for "love" and the weighted key vectors for each word.

Unnormalized Attention Scores:

      Attention("love", "I") = 0.5152
      Attention("love", "love") = 0.7178
      Attention("love", "statistics") = 0.8934

Step 3: Apply Softmax

Now, we apply the softmax function to convert the unnormalized scores into probabilities:

      Attention("love", "I") = 0.3012
      Attention("love", "love") = 0.3675
      Attention("love", "statistics") = 0.3313

Notice that the sum of the attention scores equals 1:

      0.3012 + 0.3675 + 0.3313 = 1

Step 4: Multiply Normalized Attention Scores with Weighted Value Vectors

Finally, we calculate the context vector by multiplying the normalized attention scores with the corresponding weighted value vectors for each word:

Context Vectors:

      Context Vector for "I" = [0.3012 * 0.66, 0.3012 * 0.82] = [0.1988, 0.2460]
      Context Vector for "love" = [0.3675 * 1.02, 0.3675 * 1.10] = [0.3743, 0.4042]  
      Context Vector for "statistics" = [0.3313 * 0.66, 0.3313 * 1.02] = [0.2184, 0.3377]

Understanding the Attention Mechanism

1. What is Attention Mechanism?

2. Query, Key, and Value