Understanding the Attention Mechanism

1. What is Attention Mechanism?

The attention mechanism allows models to focus on specific parts of the input sequence when producing an output. It assigns different importance to different parts of the input, enabling the model to "attend" to the most relevant parts of the sequence.

The attention mechanism works by using three key components: Query (Q), Key (K), and Value (V). These components are used to compute how much focus each part of the input should have in the output.

2. Query, Key, and Value

The attention mechanism relies on the following vectors:

3. Attention Score Calculation

The attention score between the query and key vectors is calculated as follows:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V \]

Where:

Key Information

Here’s the key information used for this explanation:

Word Embeddings (1x3):

      Word Embedding for "I": [0.2, 0.5, 0.7]
      Word Embedding for "love": [0.4, 0.8, 0.6]
      Word Embedding for "statistics": [0.1, 0.6, 0.9]
    

Weight Matrices (3x2):

      Query Weight Matrix (Wq):
      [0.1, 0.3]
      [0.2, 0.4]
      [0.5, 0.6]

      Key Weight Matrix (Wk):
      [0.6, 0.2]
      [0.3, 0.5]
      [0.7, 0.9]

      Value Weight Matrix (Wv):
      [0.5, 0.8]
      [0.2, 0.3]
      [0.4, 0.6]
    

Step 1: Compute Weighted Key, Value, and Query Vectors

We compute the weighted key, value, and query vectors by multiplying each word's embedding with the respective weight matrices. The word "love" is the query, so the query calculation is only done for "love". For other words, we calculate the weighted key and value vectors.

Query Vector for "love"

      Weighted Query for "love": (Embedding of "love" transposed) * Wq
      = [0.4, 0.8, 0.6] * Wq = [0.59, 0.78]
    

Key and Value Vectors

Similarly, we calculate the weighted key and value vectors for all words ("I", "love", "statistics"). Here are the results:

      Weighted Key Vector for "I":         [0.48, 0.54]
      Weighted Key Vector for "love":      [0.68, 0.82]
      Weighted Key Vector for "statistics":[0.77, 1.12]

      Weighted Value Vector for "I":         [0.66, 0.82]
      Weighted Value Vector for "love":      [1.02, 1.10]
      Weighted Value Vector for "statistics":[0.66, 1.02]
    

Step 2: Calculate Unnormalized Attention Scores

We calculate the unnormalized attention scores by taking the dot product between the weighted query vector for "love" and the weighted key vectors for each word.

Unnormalized Attention Scores:

      Attention("love", "I") = 0.5152
      Attention("love", "love") = 0.7178
      Attention("love", "statistics") = 0.8934
    

Step 3: Apply Softmax

Now, we apply the softmax function to convert the unnormalized scores into probabilities:

      Attention("love", "I") = 0.3012
      Attention("love", "love") = 0.3675
      Attention("love", "statistics") = 0.3313
    

Notice that the sum of the attention scores equals 1:

      0.3012 + 0.3675 + 0.3313 = 1
    

Step 4: Multiply Normalized Attention Scores with Weighted Value Vectors

Finally, we calculate the context vector by multiplying the normalized attention scores with the corresponding weighted value vectors for each word:

Context Vectors:

      Context Vector for "I" = [0.3012 * 0.66, 0.3012 * 0.82] = [0.1988, 0.2460]
      Context Vector for "love" = [0.3675 * 1.02, 0.3675 * 1.10] = [0.3743, 0.4042]  
      Context Vector for "statistics" = [0.3313 * 0.66, 0.3313 * 1.02] = [0.2184, 0.3377]