Comprehensive Guide to Preparing Data for Fine-Tuning Models with Unsloth

Step-by-Step Instructions for Beginners

Key Takeaways

Understand Model Requirements: Each model has specific data formats and structures necessary for effective fine-tuning.
Data Collection and Cleaning: Gathering high-quality, relevant data and meticulously cleaning it ensures better model performance.
Structured Formatting and Validation: Properly formatting your data and validating it before fine-tuning prevents errors and enhances training efficiency.

Introduction

Fine-tuning large language models (LLMs) such as QwN-E2.5, LLaMA, or Thinking Models using Unsloth requires meticulous preparation and structuring of your dataset. This guide provides a comprehensive, step-by-step approach suitable for beginners, complete with code examples and practical tips to ensure your data aligns with the model's requirements.

Step 1: Understanding Model Requirements

1.1. Identify the Target Model's Input Format

Different models require specific data formats. Familiarize yourself with the input structure expected by your target model:

QwN-E2.5

Uses an instruction-following format with prompts and responses separated by designated tokens.

LLaMA

Employs a prompt format that includes special tokens to differentiate between various parts of the input.

Thinking Models

Often require structured data with clear demarcations between system instructions, user prompts, and assistant responses.

1.2. Refer to Official Documentation

Consult the official documentation of the model and Unsloth to understand specific formatting guidelines and any unique requirements.

Step 2: Data Collection

2.1. Downloading from Hugging Face

Hugging Face offers a plethora of datasets suitable for various fine-tuning tasks. Select datasets relevant to your application's domain. For instance:

Conversational AI: "Anthropic/hh-rlhf"
Sentiment Analysis: "sentiment140"
Question Answering: "SQuAD"

2.2. Creating Custom Datasets

If existing datasets don't meet your needs, you can curate your own by extracting data from reliable sources or distilling content from other language models. Ensure that your custom data aligns with the intended use case of the fine-tuned model.

Example: Creating a Custom Instruction-Response Dataset

Suppose you're building a customer support assistant. Your dataset might include:

Instruction	Response
How can I reset my password?	You can reset your password by clicking on 'Forgot Password' on the login page and following the instructions sent to your email.
What is your refund policy?	Our refund policy allows returns within 30 days of purchase with a valid receipt. Please visit our refund page for more details.

Step 3: Data Cleaning

3.1. Removing Duplicates

Ensure your dataset doesn't contain duplicate entries, which can bias the model and reduce training efficiency.

3.2. Eliminating Irrelevant or Low-Quality Data

Filter out any data that doesn't contribute meaningfully to the model's training objectives. This includes removing spam, incomplete entries, or content that doesn't align with your use case.

3.3. Ensuring Consistent Formatting

Maintain a uniform structure throughout your dataset. Consistency in formatting helps the model better understand and learn from the data.

Step 4: Data Formatting

4.1. Structuring Data for the Target Model

Format your data according to the requirements of the model you intend to fine-tune. Below are examples for different models:

For QwN-E2.5

### Instruction:
What is the capital of France?

### Response:
The capital of France is Paris.

For LLaMA

<s>[INST] What is the capital of France? [/INST] The capital of France is Paris. </s>

For Thinking Models

{
  "text": "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.\n\n<|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?\n\n<|start_header_id|>assistant<|end_header_id|>\n\nThe capital of France is Paris."
}

4.2. Utilizing Templates

Adopt templates that encapsulate the required structure. Templates ensure that your data adheres to the necessary format, making the fine-tuning process smoother.

4.3. Example Code for Data Formatting

import pandas as pd

# Sample data
data = [
    {"instruction": "What is the capital of France?", "response": "The capital of France is Paris."},
    {"instruction": "What is the largest planet in our solar system?", "response": "The largest planet in our solar system is Jupiter."}
]

# Create a DataFrame
df = pd.DataFrame(data)

# Format data for QwN-E2.5
df['formatted'] = df.apply(lambda row: f"### Instruction:\n{row['instruction']}\n\n### Response:\n{row['response']}", axis=1)

# Save to CSV
df[['formatted']].to_csv('formatted_data_qwne2.5.csv', index=False)

Step 5: Data Validation

5.1. Consistency Checks

Verify that all entries in your dataset follow the prescribed format. Inconsistent data can lead to training errors and suboptimal model performance.

5.2. Token Length Verification

Ensure that the length of your input sequences doesn't exceed the model's maximum token limit (typically 2048 tokens for most models). Implement padding or truncation as necessary.

5.3. Quality Assurance

Manually review a subset of your dataset to ensure high quality and relevance. This step helps catch any overlooked issues during automated cleaning and formatting.

Step 6: Structuring Data for Unsloth

6.1. Organizing Data into CSV or JSON

Unsloth's notebooks typically work with CSV or JSON formats. Structure your data accordingly:

Example CSV Structure

instruction,response
"What is the capital of France?","The capital of France is Paris."
"What is the largest planet in our solar system?","The largest planet in our solar system is Jupiter."

6.2. Loading Data into Unsloth

Upload your formatted CSV or JSON file to the environment where Unsloth is running (e.g., Google Colab, Kaggle).

6.3. Example Code for Loading Data

import pandas as pd

# Load the CSV file
df = pd.read_csv('formatted_data_qwne2.5.csv')

# Display the first few entries
print(df.head())

Step 7: Using Unsloth's Notebooks

7.1. Uploading Your Data

Ensure your data file is accessible within the notebook environment. Use platform-specific commands to upload files if necessary.

7.2. Running the Notebook

Follow the instructions provided in Unsloth's notebook to load, preprocess, and fine-tune your model. The notebook will guide you through setting up your environment, configuring parameters, and executing the fine-tuning process.

7.3. Example Workflow in Unsloth's Notebook

# Import necessary libraries
from unsloth import UnslothTrainer
import pandas as pd

# Load your dataset
df = pd.read_csv('formatted_data_qwne2.5.csv')

# Initialize UnslothTrainer
trainer = UnslothTrainer(
    model_name='qwne2.5',
    dataset=df,
    output_dir='./fine_tuned_model'
)

# Start fine-tuning
trainer.train()

Step 8: Fine-Tuning Your Model

8.1. Configuring Fine-Tuning Parameters

Set appropriate training parameters such as learning rate, batch size, and number of epochs. These settings can significantly impact the training outcome.

8.2. Monitoring Training Progress

Keep an eye on training metrics like loss and accuracy to ensure the model is learning effectively. Adjust parameters if necessary based on the observed performance.

8.3. Example Code for Fine-Tuning with Trainer API

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('qwne2.5', num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    greater_is_better=True
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Start training
trainer.train()

Step 9: Evaluating and Iterating

9.1. Evaluating Model Performance

After fine-tuning, assess your model's performance using validation and test datasets. Metrics such as accuracy, precision, recall, and F1-score provide insights into how well the model performs.

9.2. Refining Your Dataset

If the model's performance isn't satisfactory, revisit your dataset. Consider adding more diverse examples, cleaning data further, or adjusting the data formatting for better alignment with the model's requirements.

9.3. Iterative Improvement

Fine-tuning is often an iterative process. Continuously refine your data and training parameters based on evaluation results to enhance model performance.

Conclusion

Preparing and structuring data for fine-tuning large language models with Unsloth is a critical process that demands careful attention to detail. By understanding the model's requirements, meticulously collecting and cleaning data, and following structured formatting guidelines, you can set a solid foundation for successful fine-tuning. Leveraging Unsloth's tools and resources further streamlines the process, enabling even beginners to effectively fine-tune models like QwN-E2.5, LLaMA, or Thinking Models. Remember, iterative evaluation and refinement are key to achieving optimal model performance.

References

Unsloth Documentation - Datasets 101
Medium - Fine-Tuning LLMs with Unsloth
Kaggle - QwN-E2.5 Conversational Unsloth Notebook
KDnuggets - Fine-Tuning LLaMA with Unsloth
Ridgerun AI - Fine-Tuning LLMs with Unsloth and Hugging Face
Hugging Face Blog - SFT LLaMA3
Kaggle - QwN-E2.5 Unsloth Notebook