Before embarking on the fine-tuning process, it's essential to understand the specific requirements of the model you intend to fine-tune. Determine whether your model is designed for tasks such as text classification, question-answering, or text generation, as this will significantly influence how you structure your data.
Different models may expect data in various formats. For instance:
Review the model's official documentation to understand the exact data structures and formats expected.
Depending on your project, you can either download datasets from repositories like Hugging Face or curate your own data by distilling information from other large language models (LLMs).
Ensure that your data is free from unnecessary characters, inconsistent formatting, and duplicate entries. Data cleaning is crucial to prevent introducing errors during the fine-tuning process.
Based on the model requirements identified in Step 1, structure your data into appropriate fields. Common fields include "instruction", "input", and "output" for instruction-tuning models.
{
"instruction": "Explain quantum computing",
"input": "Give a beginner-friendly explanation",
"output": "Quantum computing is a type of computing..."
}
{
"conversations": [
{
"from": "human",
"value": "What is machine learning?"
},
{
"from": "assistant",
"value": "Machine learning is a subset of AI..."
}
]
}
The choice of file format depends on the model and the fine-tuning tool you're using. Common formats include JSONL, CSV, and plain text.
[
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."},
{"instruction": "Who wrote the play Hamlet?", "input": "", "output": "The play Hamlet was written by William Shakespeare."}
]
text,label
"This is a positive review.",1
"This is a negative review.",0
Automate the data formatting process using scripts to ensure consistency and efficiency.
import csv
import json
# Define the input CSV file and output JSONL file names
input_csv = "raw_data.csv"
output_jsonl = "formatted_data.jsonl"
# Open the CSV file, read it, and output JSONL formatted file
with open(input_csv, mode="r", encoding="utf-8") as csvfile, open(output_jsonl, mode="w", encoding="utf-8") as jsonlfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Suppose your CSV has columns: instruction, input, output
data_object = {
"instruction": row.get("instruction", ""),
"input": row.get("input", ""),
"output": row.get("output", "")
}
jsonlfile.write(json.dumps(data_object) + "\n")
print(f"Data has been written to {output_jsonl}")
Open a few entries from your formatted data files to ensure that the structure and content align with the model's requirements.
import json
with open("formatted_data.jsonl", "r", encoding="utf-8") as f:
for i, line in enumerate(f):
data = json.loads(line)
print(f"Example {i + 1}: {data}")
if i >= 4: # print first 5 examples
break
Ensure that all necessary fields are present and that there are no missing or malformed entries. Consistent formatting across all data points is vital for effective fine-tuning.
Convert textual data into token IDs using a tokenizer compatible with your target model. This step is crucial for preparing data that the model can process.
from transformers import AutoTokenizer
# Load the tokenizer for your model
tokenizer = AutoTokenizer.from_pretrained("model_name")
# Tokenize your data
inputs = tokenizer("Your input text here", return_tensors="pt")
Ensure that all input sequences are of the same length by padding shorter sequences and truncating longer ones as needed. This uniformity is essential for efficient training.
# Tokenize with padding and truncation
inputs = tokenizer(
"Your input text here",
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
Divide your dataset into training and validation subsets to evaluate the model's performance during fine-tuning. A common split is 80% for training and 20% for validation.
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "validation.jsonl"})
# Alternatively, split the dataset
split_dataset = dataset["train"].train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]
The Hugging Face `datasets` library provides a convenient way to load and manipulate your data for fine-tuning.
from datasets import load_dataset
# Load your dataset from a JSONL file
dataset = load_dataset("json", data_files={"train": "formatted_data.jsonl", "validation": "validation_data.jsonl"})
Ensure that your data is compatible with Unsloth by adhering to its expected input structure and formatting conventions.
from unsloth import FastLanguageModel
# Load the model with appropriate parameters
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True
)
# Prepare model for training with PEFT
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj"],
lora_alpha=16,
lora_dropout=0
)
Ensure that your environment is properly configured with all necessary dependencies and that your data is correctly loaded.
Before committing to a full-scale fine-tuning run, perform a small-scale test using a subset of your data to ensure everything is functioning correctly.
python run_finetune.py \
--model_name_or_path "llama-base" \
--train_file "formatted_data.jsonl" \
--output_dir "./llama-finetuned" \
--per_device_train_batch_size 2 \
--num_train_epochs 1 \
--logging_steps 10
Use logging and monitoring tools to track the training progress. Evaluate the model's performance on the validation set to ensure it is learning effectively.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.optim as optim
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("llama-base")
tokenizer = AutoTokenizer.from_pretrained("llama-base")
# Prepare your dataset
# ...
# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-5)
# Fine-tuning loop
for epoch in range(5):
model.train()
total_loss = 0
for batch in train_dataset:
inputs = tokenizer(batch["instruction"], return_tensors="pt", padding=True, truncation=True).to(device)
labels = tokenizer(batch["output"], return_tensors="pt", padding=True, truncation=True).to(device)
optimizer.zero_grad()
outputs = model(**inputs, labels=labels["input_ids"])
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_dataset)}")
After fine-tuning, save your model and tokenizer for future use or deployment.
# Save the fine-tuned model and tokenizer
model.save_pretrained("./llama-finetuned")
tokenizer.save_pretrained("./llama-finetuned")
Use the validation set to assess the model's performance. Metrics like perplexity, accuracy, or task-specific evaluations can help determine the effectiveness of the fine-tuning.
from transformers import pipeline
# Load the fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./llama-finetuned")
tokenizer = AutoTokenizer.from_pretrained("./llama-finetuned")
# Initialize a generation pipeline
nlp = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Evaluate on a sample input
prompt = "What is the capital of Germany?"
response = nlp(prompt, max_length=50)
print(response)
Preparing and structuring data for fine-tuning large language models like Qwne2.5, Llama, or Thinking models with Unsloth involves several meticulous steps. Understanding the specific requirements of your target model, ensuring data quality through rigorous cleaning and formatting, and validating your data structure are foundational to the success of the fine-tuning process. By following this comprehensive guide and utilizing the provided code examples, even those new to the field can effectively prepare their data and fine-tune their models with confidence.