PROGRAMMING LANGUAGES Jan. 8, 2026, 11:30 a.m.

How to Fine-Tune Qwen 2.5 for Your Use Case

Qwen 2.5 has quickly become a go‑to large language model for developers who need a blend of strong reasoning, multilingual support, and a lightweight footprint. Whether you’re building a customer‑support bot, a domain‑specific summarizer, or a code‑assistant, fine‑tuning Qwen 2.5 can unlock performance that outpaces the generic checkpoint. In this guide we’ll walk through the entire pipeline—from environment setup to data preparation, training loops, and deployment—using clear, real‑world examples you can copy‑paste and run today.

Why Fine‑Tune Qwen 2.5?

Out‑of‑the‑box, Qwen 2.5 already understands a wide range of topics, but it isn’t optimized for the nuances of your particular dataset. Fine‑tuning aligns the model’s weights with the language patterns, terminology, and answer style that matter most to your users. This often translates into higher relevance scores, lower hallucination rates, and faster convergence on downstream tasks.

Another advantage is cost efficiency. A well‑tuned small‑to‑medium model can replace a larger, more expensive API call while delivering comparable quality for a focused use case. Finally, fine‑tuning gives you full control over safety filters, bias mitigation, and licensing compliance—critical for enterprise deployments.

Setting Up the Development Environment

Before you start training, make sure you have a GPU‑enabled environment with Python 3.10+ and the latest transformers and accelerate libraries. The following script creates a reproducible Conda environment and installs the required packages.

conda create -n qwen-finetune python=3.10 -y
conda activate qwen-finetune

pip install torch==2.2.0 \
            transformers==4.41.0 \
            datasets==2.18.0 \
            accelerate==0.30.0 \
            peft==0.7.0

After installation, verify that PyTorch detects your GPU:

import torch
print("GPU available:", torch.cuda.is_available())
print("Device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

If you see GPU available: True, you’re ready to move on. Otherwise, double‑check your driver versions and CUDA toolkit compatibility.

Preparing Your Data

Qwen 2.5 expects data in a prompt‑completion format. Each entry should contain a system message that sets the context, a user message with the query, and an assistant message with the desired answer. The datasets library makes it easy to load CSV, JSONL, or even Parquet files and convert them to this schema.

Example: Converting a CSV of FAQs

Assume you have faqs.csv with columns question and answer. The snippet below loads the file, tokenizes it, and creates a Dataset ready for fine‑tuning.

import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

df = pd.read_csv("faqs.csv")
records = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful support agent for Acme Corp."},
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["answer"]},
        ]
    }
    for _, row in df.iterrows()
]

raw_dataset = Dataset.from_dict({"data": records})
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

def tokenize_example(example):
    # Concatenate messages with special tokens used by Qwen 2.5
    prompt = ""
    for msg in example["messages"]:
        role = "<|system|>" if msg["role"] == "system" else \
               "<|user|>" if msg["role"] == "user" else "<|assistant|>"
        prompt += f"{role} {msg['content']} "
    tokenized = tokenizer(prompt, truncation=True, max_length=1024)
    return tokenized

tokenized_dataset = raw_dataset.map(tokenize_example, remove_columns=["data"])

Choosing a Fine‑Tuning Strategy

There are three popular approaches for adapting Qwen 2.5: full‑parameter training, LoRA (Low‑Rank Adaptation), and parameter‑efficient fine‑tuning (PEFT) with adapters. Full‑parameter training yields the best performance but requires massive GPU memory. LoRA strikes a balance by injecting trainable rank‑decomposition matrices into each linear layer, often reducing memory usage by 80 %.

For most practical scenarios—especially when you have limited GPU resources—LoRA is the recommended default. The peft library abstracts away the boilerplate, letting you focus on data and hyper‑parameters.

Setting Up LoRA with PEFT

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    device_map="auto",
    torch_dtype=torch.float16,
)

lora_cfg = LoraConfig(
    r=64,                # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # typical for Qwen
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

The print_trainable_parameters() call confirms that only a few million parameters will be updated, keeping the training footprint modest.

Practical Example 1: Fine‑Tuning for Text Classification

Suppose you need a model that can label incoming support tickets as bug, feature request, or general inquiry. You can treat classification as a generation task by prompting the model to output the label directly.

Dataset Preparation

We’ll use the public TicketIntent dataset (CSV with text and label columns). The label is appended to the assistant message.

import pandas as pd
from datasets import Dataset

df = pd.read_csv("ticket_intent.csv")
def format_example(row):
    return {
        "messages": [
            {"role": "system", "content": "You are an expert ticket triage assistant."},
            {"role": "user", "content": row["text"]},
            {"role": "assistant", "content": f"Label: {row['label']}"},
        ]
    }

dataset = Dataset.from_dict({"data": [format_example(r) for _, r in df.iterrows()]})
tokenized = dataset.map(tokenize_example, remove_columns=["data"])

Training Loop

training_args = TrainingArguments(
    output_dir="./qwen-ticket-classifier",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

After training, you can evaluate by feeding a raw ticket and parsing the assistant’s reply.

def classify_ticket(text):
    prompt = (
        "<|system|> You are an expert ticket triage assistant. "
        f"<|user|> {text} <|assistant|>"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=10)
    return tokenizer.decode(output[0], skip_special_tokens=True)

print(classify_ticket("The app crashes when I click the save button."))
# Expected output: "Label: bug"

Pro tip: Use torch.compile() (available from PyTorch 2.0) to accelerate inference on the fine‑tuned model, especially when you serve many requests per second.

Practical Example 2: Domain‑Specific Summarization

Imagine you run a legal tech startup and need concise summaries of lengthy contracts. By fine‑tuning Qwen 2.5 on a corpus of contract clauses and human‑written abstracts, you can achieve higher fidelity than generic summarizers.

Data Collection

Gather a set of .txt files where each file contains a contract section followed by a separator line (---) and the human summary. The loader below splits each file into source and summary pairs.

import glob, os

def load_contracts(folder):
    pairs = []
    for path in glob.glob(os.path.join(folder, "*.txt")):
        with open(path, "r", encoding="utf-8") as f:
            content = f.read().strip().split("---")
            if len(content) == 2:
                source, summary = map(str.strip, content)
                pairs.append({"source": source, "summary": summary})
    return pairs

raw_pairs = load_contracts("./contracts")

Prompt Engineering for Summarization

We’ll frame each example as a conversation where the user asks for a summary and the assistant replies with the target abstract.

def format_summ_example(pair):
    return {
        "messages": [
            {"role": "system", "content": "You are a concise legal summarizer."},
            {"role": "user", "content": f"Summarize the following clause:\n\n{pair['source']}"},
            {"role": "assistant", "content": pair["summary"]},
        ]
    }

summ_dataset = Dataset.from_dict({"data": [format_summ_example(p) for p in raw_pairs]})
tokenized_summ = summ_dataset.map(tokenize_example, remove_columns=["data"])

Training Configuration

Because summarization benefits from longer context windows, we raise max_length to 2048 tokens and use a slightly lower learning rate.

training_args = TrainingArguments(
    output_dir="./qwen-legal-summ",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=4,
    learning_rate=1e-4,
    fp16=True,
    gradient_accumulation_steps=4,
    logging_steps=20,
    evaluation_strategy="steps",
    eval_steps=200,
    save_total_limit=2,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_summ["train"],
    eval_dataset=tokenized_summ["validation"],
)

trainer.train()

After fine‑tuning, generate a summary with a single call:

def summarize_clause(clause_text):
    prompt = (
        "<|system|> You are a concise legal summarizer. "
        f"<|user|> Summarize the following clause:\n\n{clause_text} <|assistant|>"
    )
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to("cuda")
    output = model.generate(**inputs, max_new_tokens=150, temperature=0.2)
    return tokenizer.decode(output[0], skip_special_tokens=True)

example = "The lessee shall maintain the premises in good condition, ... (full clause here)"
print(summarize_clause(example))

Pro tip: Enable torch.backends.cudnn.benchmark = True before inference to let cuDNN pick the optimal kernels for the varying sequence lengths typical in summarization.

Evaluation & Metrics

Regardless of the task, you should measure performance with metrics that reflect real‑world impact. For classification, use accuracy, precision, recall, and F1. For summarization, ROUGE‑1/2/L and a human‑in‑the‑loop assessment are standard.

Here’s a quick snippet that computes classification accuracy on the test split:

from sklearn.metrics import accuracy_score

def predict_label(example):
    pred = classify_ticket(example["messages"][1]["content"])
    return pred.split(":")[-1].strip().lower()

preds = [predict_label(x) for x in tokenized["test"]]
labels = [x["messages"][2]["content"].split(":")[-1].strip().lower() for x in tokenized["test"]]
print("Accuracy:", accuracy_score(labels, preds))

For summarization, the rouge_score package makes evaluation painless:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_rouge(reference, hypothesis):
    scores = scorer.score(reference, hypothesis)
    return {k: v.fmeasure for k, v in scores.items()}

sample = tokenized_summ["validation"][0]
ref = sample["messages"][2]["content"]
hyp = summarize_clause(sample["messages"][1]["content"])
print(compute_rouge(ref, hyp))

Deploying the Fine‑Tuned Model

Once you’re satisfied with the metrics, the next step is serving the model. The transformers pipeline offers a lightweight HTTP endpoint, but for production you’ll likely want FastAPI combined with torchserve or an NVIDIA Triton Inference Server.

FastAPI Wrapper

from fastapi import FastAPI, Request
from pydantic import BaseModel

app = FastAPI()
model_path = "./qwen-ticket-classifier"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)

class Query(BaseModel):
    text: str

@app.post("/classify")
async def classify(query: Query):
    prompt = (
        "<|system|> You are an expert ticket triage assistant. "
        f"<|user|> {query.text} <|assistant|>"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=10)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return {"label": answer.split(":")[-1].strip()}

# Run with: uvicorn myapp:app --host 0.0.0.0 --port 8080

Remember to enable GPU affinity and set torch.backends.cuda.matmul.allow_tf32 = True for maximum throughput.

Monitoring & Continuous Improvement

Model performance drifts over time as user language evolves. Implement a feedback loop that captures misclassifications or unsatisfactory summaries, stores them in a “retraining bucket,” and schedules periodic fine‑tuning runs. Tools like Weights & Biases or MLflow can track metrics, hyper‑parameters, and dataset versions.

Automate the pipeline with a simple bash script or CI/CD job:

#!/usr/bin/env bash
# 1. Pull latest data
git pull origin main

# 2. Run preprocessing
python preprocess.py

# 3. Fine‑tune (reuse the



    
    
        
            
                
            
        
    

    
    
        
            Share this article