Transformer Series 3: Training the model

In the previous two posts, we built the basic and scalable versions of the Transformer model. Now, it’s time to move on to the next critical step of training the model. In this post, we’ll focus on:

Preparing the dataset.
Defining the training loop.
Using loss functions and optimizers.
Monitoring performance.
Understanding the model output.

By the end of this post, you’ll have a Transformer model that can be trained on real data, ready to make predictions.

Preparing the Dataset

The first step in training any machine learning model is to prepare a dataset. In the case of Transformers, the dataset usually consists of sequences of text.

Sample Dataset: Text Tokenization

We’ll use a toy dataset of sentences to demonstrate. In practice, you might use larger datasets like Wikitext, OpenWebText, or other public datasets.

Here’s how we can tokenize a sample dataset using torchtext or the transformers library for tokenization.

from transformers import BertTokenizer
import torch

# Initialize the tokenizer (using BERT's tokenizer for example)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample dataset (toy example)
sentences = ["The quick brown fox jumps over the lazy dog.", 
             "The Transformers architecture is very powerful."]

# Tokenize sentences
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

# Extract input ids and attention masks for the Transformer
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Print tokenized input
print(input_ids)

What’s happening here:

We use BERT’s tokenizer to convert text into sequences of integers (token IDs).
Attention masks tell the model which parts of the input are padding and which are actual tokens.

Defining the Training Loop

Once we have the dataset ready, we need to define the training loop. The loop is responsible for:

Passing inputs through the Transformer model.
Calculating loss between predicted and actual outputs.
Backpropagating the error and updating model parameters.

Here’s a basic training loop:

import torch
import torch.optim as optim

# Assuming the Transformer model from Post 2
model = Transformer(embed_size=256, heads=8, depth=4, forward_expansion=4, max_len=50, dropout=0.1, vocab_size=30522)

# Define the optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = torch.nn.CrossEntropyLoss()

# Dummy target (random example for simplicity)
target_ids = torch.tensor([[101, 2023, 2003, 1037, 3944, 102], [101, 2057, 2024, 1037, 2037, 102]])

# Training loop
def train(model, input_ids, target_ids, attention_mask, epochs=10):
    model.train()  # Set the model in training mode
    
    for epoch in range(epochs):
        optimizer.zero_grad()  # Clear previous gradients
        outputs = model(input_ids)  # Forward pass through the model

        # Reshape outputs and target to match the loss function expectations
        outputs = outputs.view(-1, outputs.size(-1))  # Flatten the output to match the target
        target_ids = target_ids.view(-1)

        # Calculate loss and backpropagate
        loss = loss_fn(outputs, target_ids)
        loss.backward()  # Backpropagate the error
        optimizer.step()  # Update the model weights

        print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

# Example of training
train(model, input_ids, target_ids, attention_mask)

Loss Function and Optimizer

Loss Function: Cross-Entropy Loss

Since we are dealing with text, the cross-entropy loss function is appropriate. It helps compare the predicted word distribution (output) against the actual word distribution (target).

Optimizer: Adam

We use Adam optimizer for training. It’s widely used for neural networks because it adapts the learning rate for each parameter.

Monitoring Performance

During training, it’s essential to monitor the loss to ensure that the model is learning effectively. If the loss is not decreasing, the model might need adjustments, such as:

Learning rate tweaks.
More epochs.
Regularization techniques like dropout or weight decay.

To visualize training progress, we can plot the loss over time using matplotlib:

import matplotlib.pyplot as plt

def train_with_monitoring(model, input_ids, target_ids, attention_mask, epochs=10):
    model.train()
    loss_values = []

    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(input_ids)
        outputs = outputs.view(-1, outputs.size(-1))
        target_ids = target_ids.view(-1)

        loss = loss_fn(outputs, target_ids)
        loss.backward()
        optimizer.step()

        loss_values.append(loss.item())
        print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

    # Plot loss over epochs
    plt.plot(range(1, epochs + 1), loss_values, label="Training Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()

# Train and monitor
train_with_monitoring(model, input_ids, target_ids, attention_mask)

This will give you a real-time look at how the model is performing across different epochs.

Explanation of the Output

Here’s what happens when you train the Transformer model:

Input Sentence: The input sentences are tokenized into numbers (IDs), representing the words in a vocabulary. Example input:
tensor([[ 101, 1996, 4248, 2829, 4419, 2058, 1996, 13971, 3899, 102], [ 101, 1996, 17288, 19085, 2324, 2003, 2200, 3787, 102, 0]])
Target: The target sequence is also tokenized in a similar way. This is used to calculate the loss.
Output: During each epoch, the model outputs a set of probabilities (logits) for each word, and the loss function compares these predictions to the actual words (target). The loss value is printed at each epoch. Example output:
Epoch 1, Loss: 5.673 Epoch 2, Loss: 4.532 ...
Loss Values: As training progresses, the loss decreases, which means the model is improving at predicting the next word in the sequence.

Visual Output:

After running the train_with_monitoring function, you’ll see a loss curve similar to this:

The x-axis represents the number of epochs.
The y-axis represents the loss value.
The curve will ideally go down as the model gets better at predicting the correct words.

Evaluation (Optional)

After training, you may want to evaluate the model on a validation dataset to check its performance before using it for actual predictions.

Example of Model Evaluation:

def evaluate(model, input_ids, target_ids, attention_mask):
    model.eval()  # Set the model to evaluation mode (no backpropagation)
    with torch.no_grad():  # Disable gradient calculation for efficiency
        outputs = model(input_ids)
        outputs = outputs.view(-1, outputs.size(-1))
        target_ids = target_ids.view(-1)

        loss = loss_fn(outputs, target_ids)
        print(f"Evaluation Loss: {loss.item()}")

# Evaluate the model on validation data
evaluate(model, input_ids, target_ids, attention_mask)

Conclusion

In this post, we walked through the process of training a Transformer model. Here’s a quick recap:

We prepared a dataset and tokenized it for the Transformer.
We defined a training loop with a loss function and optimizer.
We monitored the model’s performance during training using loss values.
Finally, we explained the model’s output during training and how to visualize the loss.

Full file

With this foundation, you can now train your own Transformer models on any dataset. In the next post, we’ll dive deeper into evaluating model performance and fine-tuning.

Transformer Series 3: Training the model

Published by Matt Clemons on October 23, 2024October 23, 2024

Preparing the Dataset

Sample Dataset: Text Tokenization

Defining the Training Loop

Loss Function and Optimizer

Loss Function: Cross-Entropy Loss

Optimizer: Adam

Monitoring Performance

Explanation of the Output

Visual Output:

Evaluation (Optional)

Example of Model Evaluation:

Conclusion

0 Comments

Leave a Reply Cancel reply

Mastering Git

Crawler to convert site documentation to PDF

Transformer Series 5: Deploying the model for real time use

Transformer Series 3: Training the model

Published by Matt Clemons on October 23, 2024October 23, 2024

Preparing the Dataset

Sample Dataset: Text Tokenization

Defining the Training Loop

Loss Function and Optimizer

Loss Function: Cross-Entropy Loss

Optimizer: Adam

Monitoring Performance

Explanation of the Output

Visual Output:

Evaluation (Optional)

Example of Model Evaluation:

Conclusion

0 Comments

Leave a Reply Cancel reply

Related Posts

Mastering Git

Crawler to convert site documentation to PDF

Transformer Series 5: Deploying the model for real time use