RELEASES Nov. 29, 2025, 11:30 p.m.

Creating RAG Applications with LangChain and Pinecone

Imagine building a smart assistant that doesn't just hallucinate answers but pulls precise information from your own documents. That's the magic of Retrieval-Augmented Generation (RAG). In this guide, we'll dive into creating powerful RAG apps using LangChain for orchestration and Pinecone as our vector database—perfect for developers wanting production-ready retrieval systems.

RAG combines the strengths of large language models (LLMs) with semantic search. You retrieve relevant chunks from a knowledge base before generating responses, ensuring accuracy and context. Let's get hands-on and build something awesome.

What is Retrieval-Augmented Generation (RAG)?

RAG solves a core LLM limitation: lack of up-to-date or domain-specific knowledge. Instead of fine-tuning models, which is costly, RAG fetches relevant data at query time.

The flow is simple: Embed your documents into vectors, store them in a vector DB like Pinecone, retrieve top-k matches for a query, and feed them to an LLM with a prompt like "Answer based on this context."

Benefits? Grounded responses, easy updates to your knowledge base, and scalability. No more "I don't know" from your bot—it's informed and reliable.

Pro Tip: Always chunk documents smartly. Long texts dilute embeddings; aim for 500-1000 characters per chunk with overlap for context preservation.

Why LangChain and Pinecone?

LangChain is a framework that glues everything together—loaders, splitters, embeddings, retrievers, and chains. It's modular, so you swap components like OpenAI embeddings for HuggingFace without rewriting code.

Pinecone? A managed vector database with serverless indexing, hybrid search, and global replication. No DevOps headaches; upsert millions of vectors and query in milliseconds.

Together, they shine for RAG: LangChain handles the pipeline, Pinecone the heavy lifting on storage and retrieval.

Setting Up Your Environment

Start with a virtual environment and key libraries. You'll need LangChain, Pinecone client, and an embedding model.

pip install langchain langchain-openai langchain-pinecone pinecone-client openai tiktoken sentence-transformers

Get API keys: OpenAI for embeddings/LLM, Pinecone for the DB. Sign up at pinecone.io, create a free index (e.g., dimension 1536 for OpenAI's text-embedding-ada-002).

Set environment variables:

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["PINECONE_API_KEY"] = "your-pinecone-key"
os.environ["PINECONE_ENV"] = "us-west1-gcp-free"  # Your index environment

Pro Tip: Use Pinecone's starter plan for prototyping—it's free and handles 100k vectors easily. Monitor upsert limits to avoid throttling.

Building Your First RAG Application

Let's create a Q&A system over PDF documents. First, load and split docs using LangChain's tools.

Step 1: Load and Chunk Documents

We'll use PyPDFLoader for PDFs and RecursiveCharacterTextSplitter for chunking.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import pinecone

# Initialize Pinecone
pinecone.init(api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"])
index_name = "rag-index"  # Your index name
if index_name not in pinecone.list_indexes():
    pinecone.create_index(name=index_name, dimension=1536, metric="cosine")

# Load PDF
loader = PyPDFLoader("your-document.pdf")
docs = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

This gives us manageable chunks with 20% overlap to keep context flowing.

Step 2: Embed and Store in Pinecone

Embed with OpenAI, then upsert to Pinecone. LangChain's PineconeVectorStore simplifies this.

# Embeddings
embeddings = OpenAIEmbeddings()

# Vector store
vectorstore = PineconeVectorStore.from_documents(
    documents=splits,
    embedding=embeddings,
    index_name=index_name
)

Boom—your knowledge base is indexed! Each chunk gets a unique ID and metadata like source.

Step 3: Retrieve and Generate

Now, query the store and chain to an LLM for answers.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# LLM
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)

# QA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
query = "What is the main topic of the document?"
result = qa_chain({"query": query})
print(result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

Run this, and you'll get a precise answer with cited sources. The retriever fetches top-3 matches via cosine similarity.

Pro Tip: Tune k based on query complexity—too few misses context, too many adds noise. Test with vectorstore.similarity_search_with_score(query) for debugging.

Advanced RAG: Handling Multiple Sources and Metadata Filtering

Real apps handle diverse data: PDFs, web pages, CSVs. Let's enhance with metadata filtering and hybrid search.

Suppose we index multiple docs with metadata like "category: finance" or "date: 2023". Pinecone supports metadata filtering natively.

# Add metadata during splitting
for i, split in enumerate(splits):
    split.metadata["chunk_id"] = i
    split.metadata["category"] = "finance"  # Example

# When upserting (already in vectorstore.from_documents)

# Filtered retrieval
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5, "filter": {"category": {"$eq": "finance"}}}
)

For multi-source: Use DirectoryLoader or WebBaseLoader.

from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader("docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
# Proceed as before

This scales to folders of docs. Combine with Pinecone's hybrid search (semantic + keyword) for better precision on rare terms.

Streaming Responses for Better UX

Production tip: Stream outputs to feel responsive.

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = OpenAI(
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()],
    temperature=0
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

Users see tokens arrive live—chatbot vibes!

Pro Tip: Implement re-ranking post-retrieval with libraries like Cohere Rerank. LangChain integrates it via ContextualCompressionRetriever to boost relevance.

Real-World Use Cases

RAG powers everyday apps. Here are battle-tested examples:

Customer Support Chatbots: Index FAQs, tickets, and product docs. Query: "How to reset password?"—instant, accurate help without agent handover.
Legal Research: Embed case laws and contracts. Filter by jurisdiction metadata for compliant advice.
E-Learning Platforms: RAG over course materials. Students ask "Explain neural nets"—gets tailored explanations from lectures.
Internal Knowledge Bases: Like Codeyaan! Index tutorials, retrieve code snippets for "How to implement auth in FastAPI?"
Personalized Recommendations: Embed user reviews and product specs for "Suggest laptops under $1000 with good battery."

These cut hallucination by 70-90% per benchmarks, saving compute and building trust.

Scaling and Best Practices

For production, namespaces in Pinecone segment indexes (e.g., per tenant). Upsert in batches:

# Batch upsert for large docs
batch_size = 100
for i in range(0, len(splits), batch_size):
    batch = splits[i:i+batch_size]
    vectorstore.add_documents(batch)

Monitor costs: Embeddings are cheap, but high-volume queries add up. Cache frequent queries with Redis.

Handle edge cases: Empty results? Fallback to "I need more info." Multi-query? Use MultiQueryRetriever for paraphrasing.

Pro Tip: Evaluate your RAG with RAGAS framework—metrics like faithfulness and context precision guide iterations.

Conclusion

We've built a full RAG app: from chunking docs to querying with context-aware LLMs. LangChain and Pinecone make it seamless, turning static data into dynamic intelligence.

Experiment with your data—tweak chunk sizes, try local embeddings like all-MiniLM-L6-v2 for privacy. Deploy to Streamlit or Gradio for instant demos.

Ready to supercharge your apps? Fork this code, index your docs, and watch RAG transform how you interact with information. Happy coding!

Share this article