HOW TO GUIDES Dec. 22, 2025, 11:30 p.m.

How to Build a RAG Chatbot with Llama 3.2

Welcome to the world of Retrieval‑Augmented Generation (RAG) with Llama 3.2! In this guide we’ll walk through every step required to turn the powerful Llama 3.2 model into a chat‑assistant that can pull facts from your own documents, answer questions in real time, and stay up‑to‑date without costly re‑training.

What is RAG and Why Llama 3.2?

RAG combines a dense vector store with a generative LLM. When a user asks a question, the system first retrieves the most relevant chunks from a knowledge base, then feeds those chunks into the LLM as context. The result is a response that is both fluent and grounded in source material.

Llama 3.2, the latest open‑source release from Meta, offers a sweet spot of high quality, low latency, and a permissive license. Its architecture is designed for efficient inference on commodity GPUs, making it ideal for on‑premise RAG deployments where data privacy matters.

Prerequisites and Environment Setup

Before we dive into code, make sure you have the following:

Python 3.10 or newer
CUDA‑enabled GPU with at least 12 GB VRAM (RTX 3060 works, larger models benefit from 24 GB)
Git, pip, and a virtual environment tool (venv or conda)

We’ll use PyTorch for model loading, sentence‑transformers for embedding documents, and FAISS as the vector store. Install everything with a single command:

python -m venv rag-env
source rag-env/bin/activate  # Windows: rag-env\Scripts\activate
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install transformers sentence-transformers faiss-cpu langchain

If you have a newer GPU, replace faiss-cpu with faiss-gpu for faster similarity search.

Downloading and Preparing Llama 3.2

Llama 3.2 is distributed via Meta’s Hugging Face hub. You’ll need a free account and acceptance of the model license.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3.2-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",          # automatically splits across GPUs
    torch_dtype="auto"
)

Once loaded, the model can be called with a simple generate request. In the RAG pipeline we’ll prepend retrieved context to the user prompt.

Step 1 – Collecting and Chunking Your Documents

The quality of a RAG chatbot hinges on how well you break raw files into searchable pieces. We’ll use langchain loaders to ingest PDFs, Markdown, and plain text, then split them into 500‑token chunks with a 100‑token overlap.

from langchain.document_loaders import PyPDFLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_documents(paths):
    docs = []
    for path in paths:
        if path.endswith(".pdf"):
            loader = PyPDFLoader(path)
        elif path.endswith(".md"):
            loader = UnstructuredMarkdownLoader(path)
        else:
            loader = TextLoader(path)
        docs.extend(loader.load())
    return docs

raw_docs = load_documents(["./data/faq.pdf", "./data/policy.md", "./data/notes.txt"])

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", " "]
)
chunks = splitter.split_documents(raw_docs)
print(f"Created {len(chunks)} chunks")

Each Document object now contains page_content and optional metadata (source file, page number, etc.). Preserving metadata is crucial for traceability when you later cite sources.

Step 2 – Embedding the Chunks

We’ll generate dense vectors with a lightweight sentence‑transformer model. The all-MiniLM-L6-v2 encoder strikes a good balance between speed and accuracy for most business documents.

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
def embed_chunks(docs):
    texts = [doc.page_content for doc in docs]
    embeddings = embedder.encode(texts, batch_size=32, show_progress_bar=True)
    return np.array(embeddings).astype("float32")

embeddings = embed_chunks(chunks)

Note that we store embeddings as float32 – FAISS expects this dtype for optimal performance.

Step 3 – Building the Vector Store with FAISS

FAISS provides an in‑memory index that supports fast approximate nearest‑neighbor search. For production you might switch to faiss‑gpu or a persisted store like Milvus.

import faiss

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)   # exact L2 search; replace with IVF for scalability
index.add(embeddings)

# Helper to map FAISS results back to Document objects
def retrieve(query, k=5):
    query_vec = embedder.encode([query], convert_to_numpy=True).astype("float32")
    distances, indices = index.search(query_vec, k)
    results = [chunks[i] for i in indices[0]]
    return results, distances[0]

Now you have a function that, given a user query, returns the top‑k most relevant document chunks.

Step 4 – Prompt Engineering for Llama 3.2

Llama 3.2 expects an instruction‑following format. We’ll wrap the retrieved context in a system prompt that tells the model to cite sources and stay concise.

SYSTEM_PROMPT = """You are a helpful assistant. Use the provided CONTEXT to answer the USER's QUESTION.
If you use information from the CONTEXT, cite the source in the format [source].
If you don't know the answer, say so honestly."""

The final prompt concatenates system, context, and user query:

def build_prompt(context_chunks, user_question):
    context_text = "\n\n".join([c.page_content for c in context_chunks])
    prompt = f"""[SYSTEM]\n{SYSTEM_PROMPT}\n\n[CONTEXT]\n{context_text}\n\n[USER]\n{user_question}"""
    return prompt

Step 5 – Generating the Answer

We’ll use the generate API with a modest temperature (0.2) to keep answers deterministic, and a max token limit of 256 to avoid runaway outputs.

def generate_answer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.2,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    # Strip the original prompt to keep only the model's reply
    return answer.split("[USER]")[-1].strip()

Putting it all together gives us a minimal RAG chatbot function:

def rag_chatbot(question):
    context, _ = retrieve(question, k=5)
    prompt = build_prompt(context, question)
    return generate_answer(prompt)

Real‑World Use Cases

Customer Support Knowledge Base: Load FAQs, policy documents, and troubleshooting guides. When a user asks “How do I reset my password?”, the bot pulls the exact steps from the internal policy and cites the document.

Legal Document Assistant: Index contracts, NDAs, and regulatory filings. Lawyers can query “What termination clause applies in contract X?” and receive a concise excerpt with a source reference, reducing time spent scrolling through PDFs.

Product Documentation Chat: Combine API specs, release notes, and code examples. Developers get instant, version‑aware answers like “Which endpoint supports bulk updates in v2.3?” without leaving their IDE.

Pro Tips & Best Practices

Tip 1 – Keep the vector store fresh. Schedule a nightly job that re‑indexes new documents. Incremental indexing is supported by FAISS; just call index.add(new_embeddings).

Tip 2 – Use hybrid retrieval. Combine BM25 (keyword) with dense vectors for better recall on technical terms that embeddings may miss.

Tip 3 – Guard against hallucinations. Post‑process the model’s output: verify that every cited source actually contains the claimed fact. A simple regex can extract [source] tags and cross‑check against the metadata.

Advanced: Adding a Reranker for Better Relevance

Sometimes the top‑k FAISS results include noisy chunks. A lightweight cross‑encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) can re‑score the retrieved set before feeding it to the LLM.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device="cuda")

def rerank(query, candidates, top_n=3):
    scores = reranker.predict([(query, c.page_content) for c in candidates])
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, s in ranked[:top_n]]

Integrate the reranker into the retrieval step:

def retrieve_with_rerank(query, k=10, final_k=5):
    initial, _ = retrieve(query, k=k)
    refined = rerank(query, initial, top_n=final_k)
    return refined

Putting It All Together – A Full Example Script

The script below demonstrates a complete end‑to‑end RAG chatbot that you can run locally. It loads documents, builds the index, and starts a simple REPL loop.

import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
import numpy as np

# ---------- Configuration ----------
MODEL_ID = "meta-llama/Meta-Llama-3.2-8B-Instruct"
EMBEDDER_NAME = "all-MiniLM-L6-v2"
RERANKER_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
DATA_DIR = "./data"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
TOP_K = 5
TOP_K_RERANK = 10

# ---------- Load LLM ----------
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
llm = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto"
)

# ---------- Load Embedders ----------
embedder = SentenceTransformer(EMBEDDER_NAME, device="cuda")
reranker = CrossEncoder(RERANKER_NAME, device="cuda")

# ---------- Document Ingestion ----------
from langchain.document_loaders import PyPDFLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def ingest_folder(folder):
    paths = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith(('.pdf', '.md', '.txt'))]
    return load_documents(paths)

raw_docs = ingest_folder(DATA_DIR)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", " "]
)
chunks = splitter.split_documents(raw_docs)

# ---------- Build Vector Store ----------
embeddings = embedder.encode([c.page_content for c in chunks], batch_size=32, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# ---------- Helper Functions ----------
def retrieve(query, k=TOP_K_RERANK):
    q_vec = embedder.encode([query], convert_to_numpy=True).astype("float32")
    dists, idxs = index.search(q_vec, k)
    return [chunks[i] for i in idxs[0]]

def rerank(query, candidates, top_n=TOP_K):
    scores = reranker.predict([(query, c.page_content) for c in candidates])
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, s in ranked[:top_n]]

def build_prompt(context_chunks, question):
    context = "\n\n".join([c.page_content for c in context_chunks])
    return f"""[SYSTEM]\n{SYSTEM_PROMPT}\n\n[CONTEXT]\n{context}\n\n[USER]\n{question}"""

def generate_answer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(llm.device)
    out = llm.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.2,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )
    text = tokenizer.decode(out[0], skip_special_tokens=True)
    return text.split("[USER]")[-1].strip()

# ---------- REPL Loop ----------
SYSTEM_PROMPT = """You are a helpful assistant. Use the CONTEXT to answer the USER's QUESTION.
Cite sources with [source] tags. If you are unsure, say so."""

print("🦙 RAG Chatbot ready! Type 'exit' to quit.")
while True:
    user_q = input("\n👤 You: ")
    if user_q.lower() in {"exit", "quit"}:
        break
    retrieved = retrieve(user_q)
    refined = rerank(user_q, retrieved)
    prompt = build_prompt(refined, user_q)
    answer = generate_answer(prompt)
    print("\n🤖 Bot:", answer)

This script is deliberately straightforward: it avoids external services, runs entirely on your GPU, and prints source‑cited answers. Feel free to swap out the vector store, embedder, or reranker for larger‑scale deployments.

Scaling Considerations

When you move from a prototype to production, keep an eye on three bottlenecks:

Embedding throughput – Batch embeddings and consider a distributed encoder if you have millions of documents.
Vector search latency – Switch from IndexFlatL2 to an IVF‑PQ or HNSW index for sub‑millisecond queries on large corpora.
Llama inference cost – Use tensor‑parallelism (e.g., DeepSpeed or vLLM) to serve many concurrent chats without exhausting GPU memory.

Monitoring these metrics with Prometheus/Grafana will help you spot performance regressions before users notice.

Security and Privacy Best Practices

RAG systems often deal with proprietary or regulated data. Follow these guidelines:

Keep the vector store encrypted at rest (FAISS can be wrapped in an encrypted filesystem).
Run Llama 3.2 in an isolated container with limited network egress.

Share this article