Hugging Face Transformers Tutorial
Welcome to the world of Hugging Face Transformers! Whether you're a seasoned data scientist or a curious developer, this tutorial will guide you through the essentials of loading, fine‑tuning, and deploying state‑of‑the‑art language models. We'll keep the explanations light, the code runnable, and sprinkle in real‑world examples you can adapt right away.
What Are Transformers and Why Hugging Face?
Transformers are a family of neural architectures that excel at processing sequential data, especially natural language. Their self‑attention mechanism lets the model weigh the importance of each token relative to all others, resulting in powerful contextual representations.
Hugging Face provides an open‑source library—transformers—that bundles thousands of pretrained models, a unified API, and tools for tokenization, training, and inference. This ecosystem reduces the friction of moving from research papers to production code.
Key Concepts to Keep in Mind
- Tokenizer: Converts raw text into model‑compatible token IDs.
- Model Class: Each architecture (BERT, GPT‑2, T5, etc.) has a corresponding Python class.
- Pipeline: High‑level abstraction for common tasks like sentiment analysis or translation.
Pro tip: Always inspect the tokenizer’s special tokens (cls_token,sep_token,pad_token) before feeding data into a model. Mismatches cause subtle bugs that are hard to debug later.
Getting Started: Installation and First Inference
The first step is to install the library and its dependencies. We’ll also pull a small model to keep the runtime light.
# Install the core library and PyTorch (or TensorFlow)
pip install transformers torch sentencepiece
Now, let’s perform sentiment analysis using the pipeline API. This one‑liner hides tokenization, model loading, and post‑processing.
from transformers import pipeline
# Load a pretrained sentiment-analysis pipeline
sentiment = pipeline("sentiment-analysis")
# Run inference on a sample sentence
result = sentiment("Hugging Face makes NLP so much easier!")
print(result)
The output will look like [{'label': 'POSITIVE', 'score': 0.9998}], confirming that the model recognized the positive sentiment. This simple example demonstrates how quickly you can prototype with Transformers.
Deep Dive: Tokenizers and Model Inputs
Behind the scenes, the pipeline uses a tokenizer to split text into subword units. Let’s explore this step manually to understand what the model actually receives.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
sentence = "Transformers are revolutionizing AI."
inputs = tokenizer(sentence, return_tensors="pt")
print(inputs)
The printed dictionary contains input_ids, attention_mask, and sometimes token_type_ids. These tensors are what the model ingests during forward passes.
Batch Processing
Processing one sentence at a time is inefficient. Tokenizers support batch encoding, automatically padding sequences to the same length.
batch = [
"I love using Hugging Face.",
"Sometimes the API can be confusing.",
"Fine‑tuning yields amazing results!"
]
batch_inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
print(batch_inputs["input_ids"].shape) # (3, max_seq_len)
Now you can feed the whole batch to the model in a single forward pass, leveraging GPU parallelism for speed.
Pro tip: Usepadding='longest'(default) for dynamic batches, orpadding='max_length'with a fixedmax_lengthwhen deploying to production to guarantee constant tensor shapes.
Fine‑Tuning a Model on Custom Data
Pretrained models are generic; to excel on a specific task, you often need to fine‑tune them on domain‑specific data. We'll walk through a sentiment‑analysis fine‑tune using the 🤗 Datasets library and the Trainer API.
Preparing the Dataset
Assume we have a CSV with two columns: text and label (0 = negative, 1 = positive). The Datasets library makes loading and preprocessing painless.
from datasets import load_dataset
# Load a local CSV; replace with your own path
data_files = {"train": "train.csv", "validation": "val.csv"}
raw_datasets = load_dataset("csv", data_files=data_files)
# Quick sanity check
print(raw_datasets["train"][0])
Next, tokenize the dataset. The map function applies the tokenizer to each example efficiently.
def tokenize_batch(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
tokenized_datasets = raw_datasets.map(tokenize_batch, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])
Setting Up the Trainer
The Trainer abstracts the training loop, handling gradient accumulation, evaluation, and checkpointing.
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
After training, you can evaluate the model on the validation set or export it for inference.
Saving and Reloading the Fine‑Tuned Model
# Save the model and tokenizer
trainer.save_model("./my-sentiment-model")
tokenizer.save_pretrained("./my-sentiment-model")
# Reload later
from transformers import pipeline
sentiment_pipe = pipeline(
"sentiment-analysis",
model="./my-sentiment-model",
tokenizer="./my-sentiment-model"
)
print(sentiment_pipe("Your custom data is now understood!"))
Pro tip: When fine‑tuning on a small dataset, enablegradient_checkpointing=TrueinTrainingArgumentsto reduce memory consumption without sacrificing performance.
Real‑World Use Cases
Transformers aren't limited to sentiment analysis. Below are three common scenarios where Hugging Face shines.
1. Text Summarization for News Articles
Summarization condenses long documents into concise bullet points. The t5-small model offers a good trade‑off between speed and quality.
from transformers import pipeline
summarizer = pipeline("summarization", model="t5-small")
article = """
Artificial intelligence has seen rapid advances in the past decade...
(Imagine a 1,000‑word news article here)
"""
summary = summarizer(article, max_length=80, min_length=30, do_sample=False)
print(summary[0]["summary_text"])
This snippet can be wrapped in a Flask endpoint to provide on‑demand summarization for a news aggregator.
2. Zero‑Shot Classification for Content Moderation
Zero‑shot models let you classify text into arbitrary labels without explicit training. This is perfect for dynamic moderation policies.
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
candidate_labels = ["spam", "harassment", "offensive", "safe"]
text = "Buy cheap sunglasses now! Click the link below."
result = classifier(text, candidate_labels)
print(result)
The output includes a confidence score for each label, enabling you to set thresholds based on your platform’s tolerance.
3. Code Generation with Codex‑Style Models
Open‑source code models like Salesforce/codegen-350M-mono can autocomplete snippets or translate pseudocode to Python.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
prompt = "def fibonacci(n):\n \"\"\"Return the nth Fibonacci number\"\"\"\n "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# Generate up to 50 new tokens
generated = model.generate(input_ids, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(generated[0], skip_special_tokens=True))
Integrate this into an IDE plugin to suggest code completions in real time.
Pro tip: When using large generative models, set temperature low for deterministic output, or higher for creative variations. Always post‑process with a linter to ensure syntactic correctness.
Optimizing Inference for Production
Deploying a Transformer model to production demands low latency and modest memory footprint. Here are three strategies you can apply.
Model Quantization
Quantization reduces the precision of weights from 32‑bit floating point to 8‑bit integers, cutting memory usage by ~4× with minimal accuracy loss.
from transformers import AutoModelForSequenceClassification
from torch.quantization import quantize_dynamic
model_fp32 = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
# Apply dynamic quantization
model_int8 = quantize_dynamic(
model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)
# Verify size reduction
import os
print("FP32 size:", os.path.getsize("fp32.pth")/1e6, "MB")
print("INT8 size:", os.path.getsize("int8.pth")/1e6, "MB")
After quantization, you can reload the model and serve it with the same inference code.
ONNX Export for Faster Runtime
Exporting to ONNX enables the use of highly optimized runtimes such as ONNX Runtime or TensorRT.
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()
dummy_input = torch.randint(0, 1000, (1, 128))
torch.onnx.export(
model,
(dummy_input,),
"distilbert.onnx",
input_names=["input_ids"],
output_names=["logits"],
dynamic_axes={"input_ids": {0: "batch_size", 1: "seq_len"},
"logits": {0: "batch_size"}},
opset_version=12
)
Load the ONNX model in a Flask or FastAPI service with onnxruntime.InferenceSession for sub‑millisecond latency.
Batching and Asynchronous Requests
When serving many concurrent users, aggregate incoming texts into a batch, run a single forward pass, and return results asynchronously.
import asyncio
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to("cuda")
async def classify_batch(texts):
inputs = tokenizer(texts, padding=True, truncation=True,
return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).cpu().numpy()
return probs
# Example usage
texts = ["I love this!", "Terrible experience.", "Meh, it’s okay."]
probs = asyncio.run(classify_batch(texts))
print(probs)
This pattern scales well behind a load balancer and keeps GPU utilization high.
Advanced Topics: Custom Heads and Multi‑Task Learning
Sometimes a single pretrained backbone isn’t enough; you may need to attach a custom classification head or share the encoder across multiple tasks.
Adding a Regression Head
Suppose you want to predict a continuous score (e.g., product rating). You can replace the classification head with a regression layer.
from transformers import AutoModel, AutoConfig
import torch.nn as nn
class BertForRegression(nn.Module):
def __init__(self, model_name):
super().__init__()
config = AutoConfig.from_pretrained(model_name)
self.bert = AutoModel.from_pretrained(model_name, config=config)
self.regressor = nn.Linear(config.hidden_size, 1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
)
# Use the [CLS] token representation
cls_output = outputs.last_hidden_state[:, 0, :]
rating = self.regressor(cls_output)
return rating
model = BertForRegression("bert-base-uncased")
Train this model with a mean‑squared‑error loss for regression tasks.
Multi‑Task Learning with Shared Encoder
Imagine you need both sentiment and topic classification from the same text. You can share the encoder and attach two heads, training them jointly.
class MultiTaskModel(nn.Module):
def __init__(self, model_name, num_sentiment_labels, num_topic_labels):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
hidden = self.encoder.config.hidden_size
self.sentiment_head = nn.Linear(hidden, num_sentiment_labels)
self.topic_head = nn.Linear(hidden, num_topic_labels)
def forward(self, input_ids, attention_mask=None):
enc_out = self.encoder(input_ids, attention_mask=attention_mask)
cls = enc_out.last_hidden_state[:, 0, :]
sentiment_logits = self.sentiment_head(cls)
topic_logits = self.topic_head(cls)
return sentiment_logits, topic_logits
model = MultiTaskModel(
"distilbert-base-uncased", num_sentiment_labels=2, num_topic_labels=5
)
During training, compute separate losses for each head and combine them (e.g., weighted sum) to backpropagate through the shared encoder.
Pro tip: When balancing losses, start with equal weights, then adjust based on validation performance. Over‑emphasizing one task can degrade the other.
Testing and Monitoring Your Transformers Service
Robust production pipelines need automated tests and monitoring. Here’s a quick checklist.
- Unit Tests: Verify tokenization, model output shapes, and edge‑case handling (empty strings, long inputs).
- Integration Tests: Spin up the API container and send real requests; assert latency < 200 ms.
- Health Checks: Expose an endpoint that returns model version and GPU memory usage.
Example health endpoint using FastAPI:
from fastapi import FastAPI
import torch
app = FastAPI()
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
@app.get("/health")
def health():
return {
"model": "dist