In 2017, the AI landscape for language understanding changed dramatically when Google Brain researchers introduced a groundbreaking algorithm called the “Transformer” in their paper “Attention is All You Need.” Before this, AI models could only process one word at a time, often resulting in awkward, incomplete language. The Transformer’s self-attention mechanism revolutionized language processing by assigning numerical scores to relationships between words, enabling the model to understand context and connections even when words are far apart in a sentence. This advancement has made LLM’s capable of generating surprisingly natural and coherent text.

Here’s the whitepaper https://arxiv.org/abs/1706.03762

Key ideas:

  • Self-Attention: The model checks how important each word is in relation to other words in the sentence.
  • Positional Encoding: Since the model looks at all words at once, it needs a way to understand the order of the words.
  • Parallel Processing: The model processes everything at the same time, which makes it fast.

Breaking Down the Parts

Self-Attention: How Words Relate to Each Other

Think of a sentence like “The dog chased the cat.” The word “chased” is closely related to “dog” and “cat.” The self-attention mechanism helps the model figure out which words are important to each other.

Here’s a simplified version of how we can code this:

import torch
import torch.nn as nn

# Self-Attention Layer
class SimpleSelfAttention(nn.Module):
def __init__(self, embed_size):
super(SimpleSelfAttention, self).__init__()
self.embed_size = embed_size

# Layers that help the model understand 'queries', 'keys', and 'values'
self.values = nn.Linear(embed_size, embed_size)
self.keys = nn.Linear(embed_size, embed_size)
self.queries = nn.Linear(embed_size, embed_size)

def forward(self, values, keys, query):
# Calculate attention by comparing queries and keys
attention = torch.matmul(query, keys.transpose(-2, -1))

# The model decides which words are most related to each other
attention_weights = torch.softmax(attention / (self.embed_size ** 0.5), dim=-1)

# Apply these attention weights to the values
out = torch.matmul(attention_weights, values)
return out

What’s happening here:

  • We create a layer that helps the model figure out the relationships between words (values, keys, queries).
  • The model calculates how much each word should pay attention to the other words and gives more weight to important words.

Feedforward Layer: Processing Each Word Individually

Once the model understands the relationships between words, it needs to process each word further using a simple math function (called a feedforward network). Think of this like fine-tuning the word’s meaning.

# Feedforward Layer
class SimpleFeedForward(nn.Module):
def __init__(self, embed_size):
super(SimpleFeedForward, self).__init__()
# Two linear layers that process the data
self.fc1 = nn.Linear(embed_size, embed_size * 2)
self.fc2 = nn.Linear(embed_size * 2, embed_size)

def forward(self, x):
# Apply a simple function (ReLU) to add some non-linearity
x = torch.relu(self.fc1(x))
return self.fc2(x)

What’s happening here:

  • We take the word and pass it through two layers of math operations (linear layers).
  • The function ReLU helps the model make more complex decisions.

Positional Encoding: Keeping Track of Word Order

Since the Transformer looks at all words at once, it needs a way to remember the order of the words. That’s what positional encoding is for—it helps the model know the position of each word in the sentence.

import math

# Positional Encoding
class SimplePositionalEncoding(nn.Module):
def __init__(self, embed_size, max_len):
super(SimplePositionalEncoding, self).__init__()

# Create a matrix to store positional encodings for each word in the sentence
self.encoding = torch.zeros(max_len, embed_size)

# Use sine and cosine math functions to create unique encodings
for pos in range(max_len):
for i in range(0, embed_size, 2):
self.encoding[pos, i] = math.sin(pos / (10000 ** ((2 * i) / embed_size)))
self.encoding[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i) / embed_size)))

def forward(self, x):
# Add positional encoding to the input to give it information about word order
seq_len = x.size(1)
return x + self.encoding[:seq_len, :]

What’s happening here:

  • We use math functions (sine and cosine) to create a pattern that helps the model remember the order of the words.
  • The model adds these positional encodings to each word’s information.

Putting It All Together

Now, we can combine these pieces to create one Transformer Block. Each block contains:

  1. Self-attention: This looks at how words relate to each other.
  2. Feedforward: This processes each word further.
  3. Positional Encoding: This tells the model the order of the words.
# Transformer Block
class SimpleTransformerBlock(nn.Module):
def __init__(self, embed_size):
super(SimpleTransformerBlock, self).__init__()
# Self-attention layer
self.attention = SimpleSelfAttention(embed_size)
# Feedforward layer
self.feed_forward = SimpleFeedForward(embed_size)
# Layer normalization to stabilize training
self.norm1 = nn.LayerNorm(embed_size)
self.norm2 = nn.LayerNorm(embed_size)

def forward(self, value, key, query):
# Apply self-attention and add it to the original input (residual connection)
attention = self.attention(value, key, query)
x = self.norm1(attention + query)

# Apply the feedforward layer and add it to the input
forward = self.feed_forward(x)
out = self.norm2(forward + x)
return out

What’s happening here:

  • The model uses self-attention to figure out which words are important to each other.
  • Then, it processes the words further using the feedforward network.
  • The model keeps track of the original input (residual connection), which helps prevent it from losing important information.

Building the Full Transformer Model

Finally, we can stack multiple Transformer blocks together to build the full Transformer model. Each layer helps the model refine its understanding of the input sentence.

# Full Transformer Model
class SimpleTransformer(nn.Module):
def __init__(self, embed_size, num_layers, vocab_size, max_len):
super(SimpleTransformer, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_size) # Converts words to dense vectors
self.pos_encoding = SimplePositionalEncoding(embed_size, max_len) # Adds positional information
self.layers = nn.ModuleList([SimpleTransformerBlock(embed_size) for _ in range(num_layers)]) # Stack layers
self.fc_out = nn.Linear(embed_size, vocab_size) # Final layer to predict the next word

def forward(self, x):
# Convert word indices into embeddings (dense vectors)
x = self.embedding(x)
# Add positional encodings
x = self.pos_encoding(x)

# Pass through each Transformer block
for layer in self.layers:
x = layer(x, x, x)

# Predict the next word
out = self.fc_out(x)
return out

What’s happening here:

  • The embedding layer converts the input words (which are just numbers) into dense vectors.
  • Positional encoding is added to give the model information about word order.
  • The input passes through several layers of Transformer blocks to get refined.
  • The final layer predicts the next word in the sequence.

Explanation of Output:

  • Input Sentence (Word Indices): The model takes an input sentence where each word is represented by a number (a word index). For example:
    Input sentence (as word indices):
    tensor([[84, 94, 65, 95, 13, 22, 91, 63, 33, 25]])

    In this case, the sentence has 10 words, each represented by a random number between 0 and 99 (because we defined a vocabulary size of 100). These numbers represent words in the vocabulary.
  • Model Output: The output is a matrix where each row represents one word in the input sentence, and each column represents the model’s score for that word belonging to each possible word in the vocabulary. The model outputs raw scores for all words in the vocabulary, and these scores can later be used to predict the next word or classify the sentence. Example:
    Model output (raw scores for each word in the vocabulary):
    tensor([[[ 0.3396, -0.0083, 0.0446, ..., 0.1134, 0.0808, -0.1356], [0.1091, -0.0892, 0.1334, ..., 0.1204, -0.0525, -0.1042], [ 0.2682, 0.0851, 0.0139, ..., 0.1157, -0.0477, -0.0428], ..., [ 0.1182, 0.0201, 0.0884, ..., 0.1215, -0.0724, -0.0982], [ 0.1078, -0.0322, 0.1099, ..., 0.1612, 0.0243, -0.0226], [ 0.1512, 0.0454, 0.0061, ..., 0.0716, 0.0277, -0.1270]]],
    grad_fn=<ViewBackward0>)
What’s happening here:
  • Input Sentence: The model is given a random sentence (just numbers representing words).
  • Embeddings: The sentence is converted into dense vectors (embeddings) that the model can work with.
  • Positional Encoding: Positional information is added so the model knows where each word is in the sentence.
  • Transformer Blocks: The input goes through several layers where the model decides which words are important to each other.
  • Output: The model produces a list of scores for each word, showing how likely it is to belong to each word in the vocabulary.

You can change the parameters (like vocab_size, embed_size, etc.) to see how the output changes, or you can modify the input sentence to see how the model handles different inputs.

Conclusion

This guide walks through the main ideas of the Transformer model:

  1. Self-Attention helps the model understand how words relate to each other.
  2. Feedforward Networks process each word further.
  3. Positional Encoding keeps track of the word order.
  4. Transformer Blocks combine these components, and stacking these blocks creates the full model.

Full file

In the next post, we’ll dig deeper.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *