Hugging Face Transformers Tutorial
HOW TO GUIDES Dec. 14, 2025, 11:30 a.m.

Hugging Face Transformers Tutorial

Welcome to the world of Hugging Face Transformers! Whether you're a seasoned data scientist or a curious developer, this tutorial will guide you through the essentials of loading, fine‑tuning, and deploying state‑of‑the‑art language models. We'll keep the explanations light, the code runnable, and sprinkle in real‑world examples you can adapt right away.

What Are Transformers and Why Hugging Face?

Transformers are a family of neural architectures that excel at processing sequential data, especially natural language. Their self‑attention mechanism lets the model weigh the importance of each token relative to all others, resulting in powerful contextual representations.

Hugging Face provides an open‑source library—transformers—that bundles thousands of pretrained models, a unified API, and tools for tokenization, training, and inference. This ecosystem reduces the friction of moving from research papers to production code.

Key Concepts to Keep in Mind

  • Tokenizer: Converts raw text into model‑compatible token IDs.
  • Model Class: Each architecture (BERT, GPT‑2, T5, etc.) has a corresponding Python class.
  • Pipeline: High‑level abstraction for common tasks like sentiment analysis or translation.
Pro tip: Always inspect the tokenizer’s special tokens (cls_token, sep_token, pad_token) before feeding data into a model. Mismatches cause subtle bugs that are hard to debug later.

Getting Started: Installation and First Inference

The first step is to install the library and its dependencies. We’ll also pull a small model to keep the runtime light.

# Install the core library and PyTorch (or TensorFlow)
pip install transformers torch sentencepiece

Now, let’s perform sentiment analysis using the pipeline API. This one‑liner hides tokenization, model loading, and post‑processing.

from transformers import pipeline

# Load a pretrained sentiment-analysis pipeline
sentiment = pipeline("sentiment-analysis")

# Run inference on a sample sentence
result = sentiment("Hugging Face makes NLP so much easier!")
print(result)

The output will look like [{'label': 'POSITIVE', 'score': 0.9998}], confirming that the model recognized the positive sentiment. This simple example demonstrates how quickly you can prototype with Transformers.

Deep Dive: Tokenizers and Model Inputs

Behind the scenes, the pipeline uses a tokenizer to split text into subword units. Let’s explore this step manually to understand what the model actually receives.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

sentence = "Transformers are revolutionizing AI."
inputs = tokenizer(sentence, return_tensors="pt")
print(inputs)

The printed dictionary contains input_ids, attention_mask, and sometimes token_type_ids. These tensors are what the model ingests during forward passes.

Batch Processing

Processing one sentence at a time is inefficient. Tokenizers support batch encoding, automatically padding sequences to the same length.

batch = [
    "I love using Hugging Face.",
    "Sometimes the API can be confusing.",
    "Fine‑tuning yields amazing results!"
]

batch_inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
print(batch_inputs["input_ids"].shape)  # (3, max_seq_len)

Now you can feed the whole batch to the model in a single forward pass, leveraging GPU parallelism for speed.

Pro tip: Use padding='longest' (default) for dynamic batches, or padding='max_length' with a fixed max_length when deploying to production to guarantee constant tensor shapes.

Fine‑Tuning a Model on Custom Data

Pretrained models are generic; to excel on a specific task, you often need to fine‑tune them on domain‑specific data. We'll walk through a sentiment‑analysis fine‑tune using the 🤗 Datasets library and the Trainer API.

Preparing the Dataset

Assume we have a CSV with two columns: text and label (0 = negative, 1 = positive). The Datasets library makes loading and preprocessing painless.

from datasets import load_dataset

# Load a local CSV; replace with your own path
data_files = {"train": "train.csv", "validation": "val.csv"}
raw_datasets = load_dataset("csv", data_files=data_files)

# Quick sanity check
print(raw_datasets["train"][0])

Next, tokenize the dataset. The map function applies the tokenizer to each example efficiently.

def tokenize_batch(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_batch, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Setting Up the Trainer

The Trainer abstracts the training loop, handling gradient accumulation, evaluation, and checkpointing.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

After training, you can evaluate the model on the validation set or export it for inference.

Saving and Reloading the Fine‑Tuned Model

# Save the model and tokenizer
trainer.save_model("./my-sentiment-model")
tokenizer.save_pretrained("./my-sentiment-model")

# Reload later
from transformers import pipeline
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="./my-sentiment-model",
    tokenizer="./my-sentiment-model"
)

print(sentiment_pipe("Your custom data is now understood!"))
Pro tip: When fine‑tuning on a small dataset, enable gradient_checkpointing=True in TrainingArguments to reduce memory consumption without sacrificing performance.

Real‑World Use Cases

Transformers aren't limited to sentiment analysis. Below are three common scenarios where Hugging Face shines.

1. Text Summarization for News Articles

Summarization condenses long documents into concise bullet points. The t5-small model offers a good trade‑off between speed and quality.

from transformers import pipeline

summarizer = pipeline("summarization", model="t5-small")
article = """
Artificial intelligence has seen rapid advances in the past decade...
(Imagine a 1,000‑word news article here)
"""

summary = summarizer(article, max_length=80, min_length=30, do_sample=False)
print(summary[0]["summary_text"])

This snippet can be wrapped in a Flask endpoint to provide on‑demand summarization for a news aggregator.

2. Zero‑Shot Classification for Content Moderation

Zero‑shot models let you classify text into arbitrary labels without explicit training. This is perfect for dynamic moderation policies.

from transformers import pipeline

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

candidate_labels = ["spam", "harassment", "offensive", "safe"]
text = "Buy cheap sunglasses now! Click the link below."

result = classifier(text, candidate_labels)
print(result)

The output includes a confidence score for each label, enabling you to set thresholds based on your platform’s tolerance.

3. Code Generation with Codex‑Style Models

Open‑source code models like Salesforce/codegen-350M-mono can autocomplete snippets or translate pseudocode to Python.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

prompt = "def fibonacci(n):\n    \"\"\"Return the nth Fibonacci number\"\"\"\n    "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate up to 50 new tokens
generated = model.generate(input_ids, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(generated[0], skip_special_tokens=True))

Integrate this into an IDE plugin to suggest code completions in real time.

Pro tip: When using large generative models, set temperature low for deterministic output, or higher for creative variations. Always post‑process with a linter to ensure syntactic correctness.

Optimizing Inference for Production

Deploying a Transformer model to production demands low latency and modest memory footprint. Here are three strategies you can apply.

Model Quantization

Quantization reduces the precision of weights from 32‑bit floating point to 8‑bit integers, cutting memory usage by ~4× with minimal accuracy loss.

from transformers import AutoModelForSequenceClassification
from torch.quantization import quantize_dynamic

model_fp32 = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# Apply dynamic quantization
model_int8 = quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

# Verify size reduction
import os
print("FP32 size:", os.path.getsize("fp32.pth")/1e6, "MB")
print("INT8 size:", os.path.getsize("int8.pth")/1e6, "MB")

After quantization, you can reload the model and serve it with the same inference code.

ONNX Export for Faster Runtime

Exporting to ONNX enables the use of highly optimized runtimes such as ONNX Runtime or TensorRT.

import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

dummy_input = torch.randint(0, 1000, (1, 128))
torch.onnx.export(
    model,
    (dummy_input,),
    "distilbert.onnx",
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch_size", 1: "seq_len"},
                  "logits": {0: "batch_size"}},
    opset_version=12
)

Load the ONNX model in a Flask or FastAPI service with onnxruntime.InferenceSession for sub‑millisecond latency.

Batching and Asynchronous Requests

When serving many concurrent users, aggregate incoming texts into a batch, run a single forward pass, and return results asynchronously.

import asyncio
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).to("cuda")

async def classify_batch(texts):
    inputs = tokenizer(texts, padding=True, truncation=True,
                       return_tensors="pt").to("cuda")
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).cpu().numpy()
    return probs

# Example usage
texts = ["I love this!", "Terrible experience.", "Meh, it’s okay."]
probs = asyncio.run(classify_batch(texts))
print(probs)

This pattern scales well behind a load balancer and keeps GPU utilization high.

Advanced Topics: Custom Heads and Multi‑Task Learning

Sometimes a single pretrained backbone isn’t enough; you may need to attach a custom classification head or share the encoder across multiple tasks.

Adding a Regression Head

Suppose you want to predict a continuous score (e.g., product rating). You can replace the classification head with a regression layer.

from transformers import AutoModel, AutoConfig
import torch.nn as nn

class BertForRegression(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        config = AutoConfig.from_pretrained(model_name)
        self.bert = AutoModel.from_pretrained(model_name, config=config)
        self.regressor = nn.Linear(config.hidden_size, 1)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        # Use the [CLS] token representation
        cls_output = outputs.last_hidden_state[:, 0, :]
        rating = self.regressor(cls_output)
        return rating

model = BertForRegression("bert-base-uncased")

Train this model with a mean‑squared‑error loss for regression tasks.

Multi‑Task Learning with Shared Encoder

Imagine you need both sentiment and topic classification from the same text. You can share the encoder and attach two heads, training them jointly.

class MultiTaskModel(nn.Module):
    def __init__(self, model_name, num_sentiment_labels, num_topic_labels):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.sentiment_head = nn.Linear(hidden, num_sentiment_labels)
        self.topic_head = nn.Linear(hidden, num_topic_labels)

    def forward(self, input_ids, attention_mask=None):
        enc_out = self.encoder(input_ids, attention_mask=attention_mask)
        cls = enc_out.last_hidden_state[:, 0, :]
        sentiment_logits = self.sentiment_head(cls)
        topic_logits = self.topic_head(cls)
        return sentiment_logits, topic_logits

model = MultiTaskModel(
    "distilbert-base-uncased", num_sentiment_labels=2, num_topic_labels=5
)

During training, compute separate losses for each head and combine them (e.g., weighted sum) to backpropagate through the shared encoder.

Pro tip: When balancing losses, start with equal weights, then adjust based on validation performance. Over‑emphasizing one task can degrade the other.

Testing and Monitoring Your Transformers Service

Robust production pipelines need automated tests and monitoring. Here’s a quick checklist.

  • Unit Tests: Verify tokenization, model output shapes, and edge‑case handling (empty strings, long inputs).
  • Integration Tests: Spin up the API container and send real requests; assert latency < 200 ms.
  • Health Checks: Expose an endpoint that returns model version and GPU memory usage.

Example health endpoint using FastAPI:

from fastapi import FastAPI
import torch

app = FastAPI()
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

@app.get("/health")
def health():
    return {
        "model": "dist
        
Share this article