Why Fine-tune at All

Before diving into how, let’s be clear about when fine-tuning actually makes sense.

Fine-tuning is not prompt engineering. If you can solve the problem with a better system prompt and a handful of examples, do that. It’s cheaper, faster to iterate, and easier to maintain.

Fine-tuning makes sense when:

  • You need consistent output format across thousands of calls
  • Your domain has vocabulary and patterns a general model hasn’t learned well
  • Latency requirements demand a smaller model that performs like a larger one
  • You’re making millions of API calls and model size economics matter

For the projects I’ve worked on — telecom documentation summarization, enterprise incident triage, healthcare note generation — fine-tuning cleared a real quality bar that prompting alone couldn’t reach.

What is QLoRA

Standard fine-tuning requires loading the full model in float32 or float16, computing gradients through all parameters, and storing optimizer states. For a 7B parameter model, that’s approximately 112 GB of GPU memory at float16 — unusable on anything short of a multi-A100 cluster.

LoRA (Low-Rank Adaptation) sidesteps this by freezing all original weights and training only small rank-decomposition matrices injected into the attention layers. Instead of updating 7B parameters, you update ~4-8M parameters. Memory requirements drop by 10-30x.

QLoRA goes further: it loads the frozen base model in 4-bit precision (NF4 quantization), reducing memory for the base weights by 4x compared to float16. Combined with gradient checkpointing and paged AdamW, a 7B model fine-tune fits on a 24GB consumer GPU (RTX 3090/4090).

Dataset Preparation

The most common mistake I see: treating dataset prep as an afterthought.

Format

For instruction fine-tuning, your data needs to be in prompt/response pairs. The Alpaca format works well:

{
  "instruction": "Summarize the following alarm description in one sentence.",
  "input": "P-GW 3201 BEARER_TIMEOUT: PDN bearer establishment timeout exceeded for IMSI 234150123456789. No response received from SGW within configured 12s window. Bearer request rejected.",
  "output": "Bearer establishment for IMSI 234150123456789 failed due to a 12-second SGW response timeout on the P-GW."
}

Quality Over Quantity

100 high-quality examples beat 10,000 mediocre ones. I’ve successfully fine-tuned on as few as 300 examples for narrow, well-defined tasks.

Quality checklist:

  • Output format is consistent across all examples
  • No contradictions between examples
  • Distribution matches your inference use case
  • No data leakage from your validation set into training

Deduplication

Run aggressive deduplication. Near-duplicate examples cause the model to overfit specific phrasings rather than learning the underlying pattern. I use MinHash LSH at Jaccard threshold 0.8 as a baseline.

The Training Setup

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Load in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,              # rank — higher = more capacity, more memory
    lora_alpha=32,     # scaling factor, typically 2x rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Rank Selection

r=8 is a safe starting point. Go to r=16 if your task requires more expressivity (complex formatting, domain-specific terminology, multi-step reasoning). r=32+ rarely helps unless you have thousands of diverse examples.

Learning Rate

Start at 2e-4 with cosine decay. Lower than 1e-4 and you’re barely learning. Higher than 4e-4 and catastrophic forgetting becomes a real risk.

What Actually Breaks

Catastrophic Forgetting

Fine-tuning on a narrow task degrades general capabilities. A model fine-tuned aggressively on technical summarization may start producing worse outputs on anything outside that narrow distribution.

Mitigation: keep your fine-tuning dataset distribution as close to your actual inference distribution as possible. Don’t include formatting examples that are wildly different from your inference prompts.

Overfitting on Small Datasets

Signs: training loss < 0.1 while validation loss stays flat or climbs. The model has memorized your training data, not learned the pattern.

Mitigation: reduce rank (r), increase dropout, and — if you have < 500 examples — consider data augmentation (synonym replacement, paraphrase with GPT-4, back-translation).

Prompt Format Drift

If your inference prompt format differs from your training format, you’ll get degraded outputs. This sounds obvious but bites constantly in production. The fine-tuned model has learned to expect specific tokens in specific positions.

Fix: standardize your prompt format in code before training. Never hardcode it in the dataset. Pull it from the same template function you use for inference.

Evaluating the Fine-tune

Don’t trust loss curves alone. Build a small evaluation harness:

  1. Held-out test set — 10-20% of your data, never seen during training
  2. Reference outputs — human-written or GPT-4-generated ideal responses for each test input
  3. Metric — ROUGE-L for summarization, exact match for classification, LLM-as-judge for open-ended tasks

I use a GPT-4 judge with a structured rubric: accuracy, format compliance, conciseness, hallucinations. Rate each dimension 1-5, average across the test set.

Production Deployment

Once you have a fine-tuned adapter, you have options:

Option 1: Serve the base model + load adapter at inference time. Flexible but adds ~50ms overhead.

Option 2: Merge adapter into base weights. Use merge_and_unload(). Zero overhead at inference, but you lose adapter flexibility.

Option 3: Serve via vLLM or TGI. Both support LoRA adapters natively. This is the right call for anything beyond prototyping.

For cost-sensitive deployments on HuggingFace Spaces or small EC2 instances, I typically merge and quantize with GGUF for llama.cpp serving. A merged, Q4_K_M quantized Mistral 7B runs at ~4.5 GB RAM and ~30 tokens/second on a single CPU core.


The gap between a fine-tuning tutorial and a production fine-tune is mostly dataset quality and evaluation rigor. The training code is almost the same. The judgment calls before and after — that’s where the work is.