Why Fine-tune at All
Before diving into how, let’s be clear about when fine-tuning actually makes sense.
Fine-tuning is not prompt engineering. If you can solve the problem with a better system prompt and a handful of examples, do that. It’s cheaper, faster to iterate, and easier to maintain.
Fine-tuning makes sense when:
- You need consistent output format across thousands of calls
- Your domain has vocabulary and patterns a general model hasn’t learned well
- Latency requirements demand a smaller model that performs like a larger one
- You’re making millions of API calls and model size economics matter
For the projects I’ve worked on — telecom documentation summarization, enterprise incident triage, healthcare note generation — fine-tuning cleared a real quality bar that prompting alone couldn’t reach.
What is QLoRA
Standard fine-tuning requires loading the full model in float32 or float16, computing gradients through all parameters, and storing optimizer states. For a 7B parameter model, that’s approximately 112 GB of GPU memory at float16 — unusable on anything short of a multi-A100 cluster.
LoRA (Low-Rank Adaptation) sidesteps this by freezing all original weights and training only small rank-decomposition matrices injected into the attention layers. Instead of updating 7B parameters, you update ~4-8M parameters. Memory requirements drop by 10-30x.
QLoRA goes further: it loads the frozen base model in 4-bit precision (NF4 quantization), reducing memory for the base weights by 4x compared to float16. Combined with gradient checkpointing and paged AdamW, a 7B model fine-tune fits on a 24GB consumer GPU (RTX 3090/4090).
Dataset Preparation
The most common mistake I see: treating dataset prep as an afterthought.
Format
For instruction fine-tuning, your data needs to be in prompt/response pairs. The Alpaca format works well:
{
"instruction": "Summarize the following alarm description in one sentence.",
"input": "P-GW 3201 BEARER_TIMEOUT: PDN bearer establishment timeout exceeded for IMSI 234150123456789. No response received from SGW within configured 12s window. Bearer request rejected.",
"output": "Bearer establishment for IMSI 234150123456789 failed due to a 12-second SGW response timeout on the P-GW."
}
Quality Over Quantity
100 high-quality examples beat 10,000 mediocre ones. I’ve successfully fine-tuned on as few as 300 examples for narrow, well-defined tasks.
Quality checklist:
- Output format is consistent across all examples
- No contradictions between examples
- Distribution matches your inference use case
- No data leakage from your validation set into training
Deduplication
Run aggressive deduplication. Near-duplicate examples cause the model to overfit specific phrasings rather than learning the underlying pattern. I use MinHash LSH at Jaccard threshold 0.8 as a baseline.
The Training Setup
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# Load in 4-bit
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16, # rank — higher = more capacity, more memory
lora_alpha=32, # scaling factor, typically 2x rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Rank Selection
r=8 is a safe starting point. Go to r=16 if your task requires more expressivity (complex formatting, domain-specific terminology, multi-step reasoning). r=32+ rarely helps unless you have thousands of diverse examples.
Learning Rate
Start at 2e-4 with cosine decay. Lower than 1e-4 and you’re barely learning. Higher than 4e-4 and catastrophic forgetting becomes a real risk.
What Actually Breaks
Catastrophic Forgetting
Fine-tuning on a narrow task degrades general capabilities. A model fine-tuned aggressively on technical summarization may start producing worse outputs on anything outside that narrow distribution.
Mitigation: keep your fine-tuning dataset distribution as close to your actual inference distribution as possible. Don’t include formatting examples that are wildly different from your inference prompts.
Overfitting on Small Datasets
Signs: training loss < 0.1 while validation loss stays flat or climbs. The model has memorized your training data, not learned the pattern.
Mitigation: reduce rank (r), increase dropout, and — if you have < 500 examples — consider data augmentation (synonym replacement, paraphrase with GPT-4, back-translation).
Prompt Format Drift
If your inference prompt format differs from your training format, you’ll get degraded outputs. This sounds obvious but bites constantly in production. The fine-tuned model has learned to expect specific tokens in specific positions.
Fix: standardize your prompt format in code before training. Never hardcode it in the dataset. Pull it from the same template function you use for inference.
Evaluating the Fine-tune
Don’t trust loss curves alone. Build a small evaluation harness:
- Held-out test set — 10-20% of your data, never seen during training
- Reference outputs — human-written or GPT-4-generated ideal responses for each test input
- Metric — ROUGE-L for summarization, exact match for classification, LLM-as-judge for open-ended tasks
I use a GPT-4 judge with a structured rubric: accuracy, format compliance, conciseness, hallucinations. Rate each dimension 1-5, average across the test set.
Production Deployment
Once you have a fine-tuned adapter, you have options:
Option 1: Serve the base model + load adapter at inference time. Flexible but adds ~50ms overhead.
Option 2: Merge adapter into base weights. Use merge_and_unload(). Zero overhead at inference, but you lose adapter flexibility.
Option 3: Serve via vLLM or TGI. Both support LoRA adapters natively. This is the right call for anything beyond prototyping.
For cost-sensitive deployments on HuggingFace Spaces or small EC2 instances, I typically merge and quantize with GGUF for llama.cpp serving. A merged, Q4_K_M quantized Mistral 7B runs at ~4.5 GB RAM and ~30 tokens/second on a single CPU core.
The gap between a fine-tuning tutorial and a production fine-tune is mostly dataset quality and evaluation rigor. The training code is almost the same. The judgment calls before and after — that’s where the work is.