How to Prepare and Structure Data for Fine-Tuning Qwne2.5, Llama, or Thinking Models with Unsloth

A Comprehensive Step-by-Step Guide with Code Examples

Key Takeaways

Understand Model Requirements: Each model has specific data formats and structures that must be adhered to for successful fine-tuning.
Data Collection and Cleaning: Gathering high-quality, relevant data and ensuring it is free from inconsistencies and errors is crucial.
Data Formatting and Validation: Properly formatting the data with the correct fields and validating its structure ensures compatibility with the fine-tuning process.

Step 1: Understand the Model Requirements

Identify the Model Type and Task

Before embarking on the fine-tuning process, it's essential to understand the specific requirements of the model you intend to fine-tune. Determine whether your model is designed for tasks such as text classification, question-answering, or text generation, as this will significantly influence how you structure your data.

Determine the Expected Input and Output Formats

Different models may expect data in various formats. For instance:

Qwne2.5: Typically expects data in a JSON structure with specific fields like "input" and "output".
Llama: Often compatible with Hugging Face's datasets library, requiring data structured as a list of dictionaries.
Thinking Models: May have unique formatting requirements depending on their architecture.

Review the model's official documentation to understand the exact data structures and formats expected.

Step 2: Collect and Organize Your Data

Gather Raw Data

Depending on your project, you can either download datasets from repositories like Hugging Face or curate your own data by distilling information from other large language models (LLMs).

Clean the Data

Ensure that your data is free from unnecessary characters, inconsistent formatting, and duplicate entries. Data cleaning is crucial to prevent introducing errors during the fine-tuning process.

Organize the Data into Relevant Fields

Based on the model requirements identified in Step 1, structure your data into appropriate fields. Common fields include "instruction", "input", and "output" for instruction-tuning models.

Example Data Structures

Alpaca Format:


{
  "instruction": "Explain quantum computing",
  "input": "Give a beginner-friendly explanation",
  "output": "Quantum computing is a type of computing..."
}

ShareGPT Format:


{
  "conversations": [
    {
      "from": "human",
      "value": "What is machine learning?"
    },
    {
      "from": "assistant",
      "value": "Machine learning is a subset of AI..."
    }
  ]
}

Step 3: Format Your Data

Choose the Appropriate File Format

The choice of file format depends on the model and the fine-tuning tool you're using. Common formats include JSONL, CSV, and plain text.

Format the Data According to Model Specifications

JSONL Format for Instruction-Tuning Models


[
  {"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."},
  {"instruction": "Who wrote the play Hamlet?", "input": "", "output": "The play Hamlet was written by William Shakespeare."}
]

CSV Format for Classification Tasks


text,label
"This is a positive review.",1
"This is a negative review.",0

Use Preprocessing Scripts

Automate the data formatting process using scripts to ensure consistency and efficiency.

Example Preprocessing Script in Python


import csv
import json

# Define the input CSV file and output JSONL file names
input_csv = "raw_data.csv"
output_jsonl = "formatted_data.jsonl"

# Open the CSV file, read it, and output JSONL formatted file
with open(input_csv, mode="r", encoding="utf-8") as csvfile, open(output_jsonl, mode="w", encoding="utf-8") as jsonlfile:
    reader = csv.DictReader(csvfile)
    
    for row in reader:
        # Suppose your CSV has columns: instruction, input, output
        data_object = {
            "instruction": row.get("instruction", ""),
            "input": row.get("input", ""),
            "output": row.get("output", "")
        }
        jsonlfile.write(json.dumps(data_object) + "\n")

print(f"Data has been written to {output_jsonl}")

Step 4: Validate Your Data

Manual Inspection

Open a few entries from your formatted data files to ensure that the structure and content align with the model's requirements.

Automated Validation

Example Sampling Script


import json

with open("formatted_data.jsonl", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        data = json.loads(line)
        print(f"Example {i + 1}: {data}")
        if i >= 4:  # print first 5 examples
            break

Check for Consistency and Completeness

Ensure that all necessary fields are present and that there are no missing or malformed entries. Consistent formatting across all data points is vital for effective fine-tuning.

Step 5: Preprocess the Data

Tokenization

Convert textual data into token IDs using a tokenizer compatible with your target model. This step is crucial for preparing data that the model can process.

Example Tokenization Code Using Hugging Face Transformers


from transformers import AutoTokenizer

# Load the tokenizer for your model
tokenizer = AutoTokenizer.from_pretrained("model_name")

# Tokenize your data
inputs = tokenizer("Your input text here", return_tensors="pt")

Padding and Truncation

Ensure that all input sequences are of the same length by padding shorter sequences and truncating longer ones as needed. This uniformity is essential for efficient training.

Example Padding and Truncation


# Tokenize with padding and truncation
inputs = tokenizer(
    "Your input text here",
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

Step 6: Split the Dataset

Training and Validation Sets

Divide your dataset into training and validation subsets to evaluate the model's performance during fine-tuning. A common split is 80% for training and 20% for validation.

Example Dataset Splitting


from datasets import load_dataset

# Load your dataset
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "validation.jsonl"})

# Alternatively, split the dataset
split_dataset = dataset["train"].train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]

Step 7: Load Data into a Suitable Format for Fine-Tuning

Using Hugging Face Datasets Library

The Hugging Face `datasets` library provides a convenient way to load and manipulate your data for fine-tuning.

Example Loading Data


from datasets import load_dataset

# Load your dataset from a JSONL file
dataset = load_dataset("json", data_files={"train": "formatted_data.jsonl", "validation": "validation_data.jsonl"})

Integrate with Unsloth

Ensure that your data is compatible with Unsloth by adhering to its expected input structure and formatting conventions.

Example Integration with Unsloth


from unsloth import FastLanguageModel

# Load the model with appropriate parameters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True
)

# Prepare model for training with PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_alpha=16,
    lora_dropout=0
)

Step 8: Fine-Tune the Model with Unsloth

Set Up the Fine-Tuning Environment

Ensure that your environment is properly configured with all necessary dependencies and that your data is correctly loaded.

Run a Small-Scale Test

Before committing to a full-scale fine-tuning run, perform a small-scale test using a subset of your data to ensure everything is functioning correctly.

Example Fine-Tuning Command


python run_finetune.py \
  --model_name_or_path "llama-base" \
  --train_file "formatted_data.jsonl" \
  --output_dir "./llama-finetuned" \
  --per_device_train_batch_size 2 \
  --num_train_epochs 1 \
  --logging_steps 10

Monitor and Evaluate Training

Use logging and monitoring tools to track the training progress. Evaluate the model's performance on the validation set to ensure it is learning effectively.

Example Monitoring Script


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.optim as optim

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")

# Prepare your dataset
# ...

# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-5)

# Fine-tuning loop
for epoch in range(5):
    model.train()
    total_loss = 0
    for batch in train_dataset:
        inputs = tokenizer(batch["instruction"], return_tensors="pt", padding=True, truncation=True).to(device)
        labels = tokenizer(batch["output"], return_tensors="pt", padding=True, truncation=True).to(device)
        
        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels["input_ids"])
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_dataset)}")

Step 9: Save and Evaluate Your Fine-Tuned Model

Save the Model

After fine-tuning, save your model and tokenizer for future use or deployment.

Example Saving Code


# Save the fine-tuned model and tokenizer
model.save_pretrained("./llama-finetuned")
tokenizer.save_pretrained("./llama-finetuned")

Evaluate Performance

Use the validation set to assess the model's performance. Metrics like perplexity, accuracy, or task-specific evaluations can help determine the effectiveness of the fine-tuning.

Example Evaluation Code


from transformers import pipeline

# Load the fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./llama-finetuned")
tokenizer = AutoTokenizer.from_pretrained("./llama-finetuned")

# Initialize a generation pipeline
nlp = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Evaluate on a sample input
prompt = "What is the capital of Germany?"
response = nlp(prompt, max_length=50)
print(response)

Conclusion

Preparing and structuring data for fine-tuning large language models like Qwne2.5, Llama, or Thinking models with Unsloth involves several meticulous steps. Understanding the specific requirements of your target model, ensuring data quality through rigorous cleaning and formatting, and validating your data structure are foundational to the success of the fine-tuning process. By following this comprehensive guide and utilizing the provided code examples, even those new to the field can effectively prepare their data and fine-tune their models with confidence.