A Comprehensive Guide to Fine-Tuning TinyLLaMA with Unsloth

Fine-Tuning TinyLLaMA with Unsloth: A Hands-On Guide

Hey there, folks! Tommy here, ready to dive into the exciting world of fine-tuning TinyLLaMA, a Small Language Model (SLM) optimized for edge devices like mobile phones. Whether you're an intermediate developer, AI enthusiast, or gearing up for your next hackathon project, this tutorial will walk you through everything you need to know to fine-tune TinyLLaMA using Unsloth.

Now let's get started!

Prerequisites

Before we jump into the tutorial, make sure you have the following:

Basic Python Knowledge
Familiarity with Machine Learning Concepts
A Google account for accessing Google Colab.
A W&B account (you can sign up here).

Setting Up Fine-Tuning Environment

We’ll use Google Colab to fine-tune TinyLLaMA, which offers a free and accessible GPU for this process. Here’s how to get started:

Create a New Colab Notebook:

First, head over to Google Colab and create a new notebook.
Next, ensure you have a GPU available by setting the notebook's runtime to use a GPU. You can do this by going to the menu and selecting Runtime > Change runtime type. In the window that appears, choose T4 GPU from the Hardware accelerator section.

Install Dependencies:

Now we need to install the required libraries and dependencies. Run the command below in your code cell:

!pip install -q unsloth transformers datasets wandb

Loading the Model and Tokenizer

After setting up your environment, the next step is to load the TinyLLaMA model and its tokenizer.

How to Load the TinyLLaMA Model

Here’s how to load the TinyLLaMA model with some configuration options:

from transformers import TinyLlamaModel, TinyLlamaTokenizer

tokenizer = TinyLlamaTokenizer.from_pretrained('tiny-lama')
model = TinyLlamaModel.from_pretrained('tiny-lama')

Layer Selection and Hyperparameters

After loading the model, the next step involves configuring it for fine-tuning by selecting specific layers and setting key hyperparameters. We'll be using the get_peft_model method from the FastLanguageModel provided by Unsloth. This method allows us to apply Parameter-Efficient Fine-Tuning (PEFT) techniques, specifically Low-Rank Adaptation (LoRA), which helps in adapting the model with fewer parameters while maintaining performance.

Configuring the Model

Here’s the code to configure the model:

from unsloth import get_peft_model
peft_model = get_peft_model(model, ...)

When fine-tuning TinyLLaMA, special attention should be given to the attention and feed-forward layers.

Attention Layers:

These layers are key to how TinyLLaMA focuses on different parts of the input. By fine-tuning these layers, you help the model better understand and contextualize the data. Examples of the layers used are "q_proj", "k_proj", "v_proj", "o_proj".

Feed-Forward Layers:

These layers handle the transformations post-attention, crucial for the model’s ability to process and generate complex outputs. Optimizing these layers can greatly enhance performance on specific tasks. Examples of the layers used are "gate_proj", "up_proj", "down_proj".

Preparing the Dataset and Defining the Prompt Format

After configuring your model, the next step is to prepare your dataset and define the prompt format. For this tutorial, we'll use the Alpaca dataset from Hugging Face, but I'll also show you how to create and load a custom dataset if you want to use your own data.

Using the Alpaca Dataset

The Alpaca dataset is designed for training models to follow instructions. We’ll load it directly from Hugging Face and format it according to the structure expected by the TinyLLaMA model.

from datasets import load_dataset

dataset = load_dataset('tatsu-lab/alpaca')

Creating and Loading a Custom Dataset

If you want to use your own custom dataset, you can easily do so by following these steps:

Create a JSON file with your data. The file should contain a list of objects, each with instruction, input, and output fields. For example:

[{"instruction": "What is AI?", "input": "", "output": "AI is the simulation of human intelligence in machines."}...]

Save this JSON file, for example as dataset.json.

Load the custom dataset using the load_dataset function from the Hugging Face datasets library:

custom_dataset = load_dataset('json', data_files='dataset.json')

Monitoring Fine-Tuning with W&B

Weights & Biases (W&B) is an essential tool for tracking your model’s training process and system resource usage. It helps visualize metrics in real time, providing valuable insights into both model performance and GPU utilization.

We’ll use W&B to monitor our training process, including evaluation metrics and resource usage:

You can sign up for W&B and get your API key here. This setup will allow you to track all the important metrics in real-time.

Training TinyLLaMA with W&B Integration

Now that everything is set up, it’s time to train the TinyLLaMA model. We’ll be using the SFTTrainer from the trl library, along with Weights & Biases (W&B) for real-time tracking of training metrics and resource usage. This step ensures you can monitor your training effectively and make necessary adjustments on the fly.

Initializing W&B and Setting Training Arguments

import wandb

wandb.init(project='tiny-llama')

training_args = {...}

Next, we set up the SFTTrainer:

sft_trainer = SFTTrainer(...)

Efficient Resource Management

Batch Size and Gradient Accumulation: Due to the limited memory of the GPU, especially in a free Colab environment, keep the batch size small. Use gradient accumulation to simulate a larger batch size, which stabilizes training without running out of memory.
Mixed Precision Training: Utilize mixed precision (FP16 or BF16) to reduce memory usage and speed up training, particularly on modern GPUs like Tesla T4 or Ampere-based GPUs.
Efficient Resource Management: By using 4-bit quantization (load_in_4bit=True) and 8-bit optimizers, you significantly reduce the memory footprint, allowing for more efficient training on smaller devices.
Logging and Monitoring: W&B provides real-time monitoring of training metrics such as loss, accuracy, and resource usage (CPU/GPU). Use this to keep an eye on the training dynamics and adjust hyperparameters if needed.
Evaluation Strategy: Set the evaluation strategy to "steps" so that the model is evaluated periodically during training, allowing you to monitor its progress and prevent overfitting early on.

Once everything is set up, start the training loop. Once the training is complete, you can use the model as needed, and don't forget to properly close out your Weights & Biases (W&B) session.

Monitoring Training with Weights & Biases (W&B)

After integrating W&B into your training script, you can easily monitor various metrics from the W&B dashboard.

To View and Interpret These Metrics:

Log in to W&B: Once your training starts, log in to the W&B website.
Navigate to Your Project: In your workspace, locate the project you've set up (e.g., "tiny-llama") and click on it.
Explore the Dashboard: In the eval, train, and system sections, you'll find a variety of metrics visualized over time.

Interpretation of Metrics

Evaluation Metrics:

Loss: Monitoring this helps identify overfitting - if this loss starts increasing while training loss decreases, it suggests overfitting.
Steps per Second: Measures training speed, helping optimize computational efficiency.

Training Metrics:

Loss: Indicates how well the model learns from the training data. A decreasing loss typically indicates that the model is learning, but a very low loss might suggest overfitting.
Learning Rate: Adjusts during training to ensure model convergence without overshooting.

System Resource Usage:

GPU Power and Memory Usage: Visualize how your model utilizes the GPU. High and stable usage suggests efficient training.

Testing the Fine Tuned Model

After fine-tuning your model, you can test its performance with the following code:

output = model.generate(input)

Before and After Fine-Tuning

Before Fine-Tuning: The model doesn't give any response at all.
After Fine-Tuning: The model should provide accurate and contextually appropriate answers.

Saving the Fine-Tuned Model

The next step is to save the model. You can save the model locally or push it to the Hugging Face Hub for easy sharing and future use. Here's how you can do both:

Saving the Model Locally

model.save_pretrained('path/to/save')
tokenizer.save_pretrained('path/to/save')

Saving the Model to the Hugging Face Hub

To share your model with the community or to easily access it later, you can push it to the Hugging Face Hub. First, log in to your Hugging Face account:

!huggingface-cli login

Then, use the following commands to push the model and tokenizer:

model.push_to_hub('my-model')
tokenizer.push_to_hub('my-model')

Free Up GPU Space

Now that your model is saved you can run the following commands to free up space from the GPU:

del model
del tokenizer
import gc
gc.collect()

Practical Tips

Avoid Overfitting:

Overfitting happens when your model learns the training data too well, capturing noise and irrelevant patterns that don't generalize to new data. To prevent this, monitor the training process closely.

Early Stopping: Stop training when the validation performance stops improving. This prevents the model from continuing to fit the training data unnecessarily.
Regularization: Add techniques like dropout, which randomly drops neurons during training, or L2 regularization, which penalizes large weights in the model, to prevent it from becoming too complex.

Handle Imbalanced Data:

Imbalanced datasets occur when some classes have significantly more examples than others, which can lead the model to become biased toward the majority class.

Oversampling: Increase the number of examples of the minority class by duplicating them or generating synthetic data. This balances the dataset and gives the model more examples to learn from.
Class Weighting: Adjust the loss function to penalize misclassifications of the minority class more heavily. This way, the model is encouraged to pay more attention to the underrepresented classes.

Fine-Tuning on Limited Data:

When you have a small dataset, it’s challenging to train a model from scratch without overfitting.

Data Augmentation: Generate more training examples by applying transformations like cropping, rotating, or adding noise to existing data.
Transfer Learning with LoRA: Leverage a pre-trained model that has already learned useful features from a large dataset. With LoRA, you only need to fine-tune a few parameters on your specific task, which can significantly improve performance even with limited data.

Advanced Considerations

For those looking to push the boundaries further:

Layer-Specific Fine-Tuning: Focus on fine-tuning specific layers more aggressively, such as attention or feed-forward layers.
Transfer Learning: Apply the model to different tasks by adjusting only the final layers.
Integration with Other Models: Enhance TinyLLaMA by combining it with other models or techniques like retrieval-augmented generation (RAG).

Conclusion

In this tutorial, we’ve explored the powerful techniques for fine-tuning TinyLLaMA using Unsloth while emphasizing efficiency and resource management. Fine-tuning on a Google Colab T4 GPU took approximately 6 minutes, utilizing around 4GB of GPU RAM, with max steps set to 100 (not a full epoch). Running the dataset for at least one epoch is recommended for more significant changes.

We used the tatsu-lab/alpaca dataset and guided on creating and loading custom datasets. Tracking training metrics with Weights & Biases (W&B) was essential for monitoring model performance and GPU utilization in real time, offering valuable insights for optimization.

We learned that freeing the GPU after saving the fine-tuned model, either locally or to the hub, improves resource management, allowing for efficient GPU memory usage.

With these skills, you can now confidently tackle various fine-tuning tasks, ensuring your models are both effective and resource-efficient. Happy modeling!