Products claude llama model version update product

Building a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real Implementation

DEV Communityby Rishabh SethiaApril 1, 202624 min read0 views

<h1> Building a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real Implementation </h1> <p>Most RAG tutorials use a 10-page PDF about Shakespeare and call it a day. You get a working demo in 20 minutes, deploy nothing, and learn the one thing that least resembles production: that RAG is easy.</p> <p>It isn't. The demo is easy. Production RAG — where your retrieval actually returns the right chunks, your answers are grounded in the source, and the system doesn't hallucinate when it can't find an answer — takes deliberate engineering at every stage of the pipeline.</p> <p>This is a real implementation guide. We'll build a RAG pipeline using LangChain, Pinecone, and Claude that could actually serve a client product. Every decision explained, every gotcha documented.</p> <p><

Building a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real Implementation

Most RAG tutorials use a 10-page PDF about Shakespeare and call it a day. You get a working demo in 20 minutes, deploy nothing, and learn the one thing that least resembles production: that RAG is easy.

It isn't. The demo is easy. Production RAG — where your retrieval actually returns the right chunks, your answers are grounded in the source, and the system doesn't hallucinate when it can't find an answer — takes deliberate engineering at every stage of the pipeline.

This is a real implementation guide. We'll build a RAG pipeline using LangChain, Pinecone, and Claude that could actually serve a client product. Every decision explained, every gotcha documented.

What you'll have at the end: A working RAG system that ingests a document corpus, chunks it intelligently, embeds it into Pinecone, retrieves with hybrid search, generates grounded answers with Claude, and evaluates itself.

Prerequisites

Python 3.10+
Pinecone account (free tier works for development)
Anthropic API key
OpenAI API key (for embeddings — we'll explain why we use OpenAI for embeddings and Anthropic for generation)
~2 hours

pip install langchain langchain-anthropic langchain-openai langchain-pinecone \  pinecone-client pinecone-text python-dotenv pypdf tiktoken

pip install langchain langchain-anthropic langchain-openai langchain-pinecone \  pinecone-client pinecone-text python-dotenv pypdf tiktoken

Enter fullscreen mode

Exit fullscreen mode

Step 1: Document Ingestion and Chunking Strategy

Chunking is where most RAG implementations fail silently. The chunk size question — "should I use 512 tokens or 1,000?" — is the wrong question. The right question is: what is the minimum self-contained unit of meaning in my documents?

For a product FAQ document, that's a single Q&A pair. For a policy document, it's a section. For a knowledge base article, it's a paragraph. Fixed-size token chunking destroys these natural boundaries.

We use a two-pass chunking strategy:

Pass 1: Structural splitting — split at document boundaries (headers, sections) first Pass 2: Size enforcement — only apply token limits within those structural chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader from langchain.schema import Document import re from typing import List

from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader from langchain.schema import Document import re from typing import List

class SemanticChunker: """Chunks documents at semantic boundaries, not arbitrary token counts."""

def init(self, max_chunk_tokens: int = 400, overlap_tokens: int = 50):

400 tokens is our default — not 512.

Here's why: at 512 tokens, chunks often end mid-sentence. At 400,

there's buffer to complete the thought within the token limit.

self.splitter = RecursiveCharacterTextSplitter( chunk_size=400 * 4, # ~4 chars per token estimate chunk_overlap=overlap_tokens * 4, separators=["\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""], length_function=len ) self.max_chunk_tokens = max_chunk_tokens

def chunk_document(self, file_path: str, doc_metadata: dict) -> List[Document]: loader = PyPDFLoader(file_path) pages = loader.load()

Clean up common PDF extraction artifacts

for page in pages: page.page_content = self.clean_text(page.page_content) page.metadata.update(doc_metadata)

Split into chunks

chunks = self.splitter.split_documents(pages)

Add chunk index for debugging retrieval issues

for i, chunk in enumerate(chunks): chunk.metadata['chunk_index'] = i chunk.metadata['chunk_total'] = len(chunks)

return chunks

def clean_text(self, text: str) -> str:

Remove page headers/footers (common in policy docs)

text = re.sub(r'Page \d+ of \d+', '', text)

Normalize whitespace

text = re.sub(r'\s+', ' ', text).strip()

Remove lone single characters (OCR artifacts)

text = re.sub(r'(?

Enter fullscreen mode

Exit fullscreen mode

Why 400 tokens and not 512? In our production implementations, 512-token chunks frequently end mid-sentence when the content has long paragraphs. The 400-token limit with 50-token overlap ensures context continuity without cutting thoughts short. Adjust this per your document structure — technical documentation often benefits from 300-token chunks; narrative content from 500.

Step 2: Embedding Model Selection

We use OpenAI text-embedding-3-small for embeddings, even in Claude-based systems. Why not Anthropic embeddings? Anthropic doesn't offer an embedding API. For production English-language applications, text-embedding-3-small provides excellent quality at low cost (~$0.02 per million tokens).

For multilingual use cases (Hindi, Arabic — relevant for our India/GCC client base), we switch to Cohere's embed-multilingual-v3.0.

Critical rule: never mix embedding models. Your query at retrieval time must use the same model as the documents at ingestion time. Mixing models produces semantically inconsistent similarity scores and silent retrieval failures.

from langchain_openai import OpenAIEmbeddings from pinecone import Pinecone, ServerlessSpec import os from dotenv import load_dotenv

from langchain_openai import OpenAIEmbeddings from pinecone import Pinecone, ServerlessSpec import os from dotenv import load_dotenv

load_dotenv()

Initialize embedding model

embedding_model = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key=os.getenv("OPENAI_API_KEY") )

Initialize Pinecone

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

INDEX_NAME = "rag-knowledge-base"

Create index if it doesn't exist

if INDEX_NAME not in pc.list_indexes().names(): pc.create_index( name=INDEX_NAME, dimension=1536, # text-embedding-3-small dimension metric="cosine", spec=ServerlessSpec( cloud="aws", region="us-east-1" ) ) print(f"Created Pinecone index: {INDEX_NAME}")

index = pc.Index(INDEX_NAME)`

Enter fullscreen mode

Exit fullscreen mode

Step 3: Ingestion with Metadata Filtering

Metadata in Pinecone is how you scope queries. If your knowledge base has multiple document types — product FAQs, return policies, shipping info — you can filter at query time to only retrieve from the relevant subset.

from langchain_pinecone import PineconeVectorStore from tqdm import tqdm

from langchain_pinecone import PineconeVectorStore from tqdm import tqdm

def ingest_documents(chunks: List[Document], batch_size: int = 100) -> PineconeVectorStore: """Ingest document chunks into Pinecone with progress tracking."""

print(f"Ingesting {len(chunks)} chunks into Pinecone...")

Process in batches to avoid API rate limits

for i in tqdm(range(0, len(chunks), batch_size), desc="Ingesting"): batch = chunks[i:i + batch_size]

Ensure all metadata values are Pinecone-compatible types

(strings, numbers, booleans — no lists of complex objects)

for chunk in batch: chunk.metadata = { k: str(v) if not isinstance(v, (str, int, float, bool)) else v for k, v in chunk.metadata.items() }

Create vector store from documents

vectorstore = PineconeVectorStore.from_documents( documents=chunks, embedding=embedding_model, index_name=INDEX_NAME, pinecone_api_key=os.getenv("PINECONE_API_KEY") )

print(f"Ingestion complete. Index stats: {index.describe_index_stats()}") return vectorstore

vectorstore = ingest_documents(chunks)`

Enter fullscreen mode

Exit fullscreen mode

Step 4: Hybrid Search Retrieval

This is the step that separates production RAG from tutorial RAG. Dense vector search alone has a known weakness: it matches semantic meaning but can miss exact keyword matches. If a user asks "what is the policy for order cancellation within 2 hours" and your document says "2-hour cancellation window," pure semantic search may not rank that chunk highest.

Hybrid search combines dense vectors (semantic) with sparse BM25 (keyword). The alpha parameter controls the blend.

from pinecone_text.sparse import BM25Encoder import numpy as np

from pinecone_text.sparse import BM25Encoder import numpy as np

class HybridRetriever: def init(self, vectorstore, index, embedding_model, bm25_path: str = None): self.vectorstore = vectorstore self.index = index self.embedding_model = embedding_model

Load or initialize BM25

if bm25_path and os.path.exists(bm25_path): self.bm25 = BM25Encoder().load(bm25_path) else: self.bm25 = BM25Encoder().default() # Use default params for now

def fit_bm25(self, corpus: List[str], save_path: str = "bm25_params.json"): """Fit BM25 on your document corpus. Do this once during ingestion.""" self.bm25.fit(corpus) self.bm25.dump(save_path) print(f"BM25 fitted on {len(corpus)} documents, saved to {save_path}")

def retrieve( self, query: str, top_k: int = 5, alpha: float = 0.5, metadata_filter: dict = None ) -> List[dict]: """ Hybrid search: alpha=1.0 is pure dense, alpha=0.0 is pure sparse. We start at 0.5 and tune based on query type. """

Dense query vector

dense_vector = self.embedding_model.embed_query(query)

Sparse query vector

sparse_vector = self.bm25.encode_queries(query)

Pinecone hybrid query

query_params = { "vector": dense_vector, "sparse_vector": sparse_vector, "top_k": top_k, "include_metadata": True, "alpha": alpha }

if metadata_filter: query_params["filter"] = metadata_filter

results = self.index.query(query_params)

return [ { "text": match.metadata.get("text", ""), "score": match.score, "metadata": match.metadata, "id": match.id } for match in results.matches ]

Fit BM25 on corpus text (do this once)

corpus_texts = [chunk.page_content for chunk in chunks] retriever = HybridRetriever(vectorstore, index, embedding_model) retriever.fit_bm25(corpus_texts, save_path="bm25_params.json")`

Enter fullscreen mode

Exit fullscreen mode

Step 5: The Generation Prompt — Minimising Hallucination

The generation prompt is where most developers underinvest. The default "here is context, answer the question" pattern works for demos. For production, you need explicit grounding instructions and a defined behaviour when the answer isn't in the retrieved context.

from langchain_anthropic import ChatAnthropic from langchain.schema import HumanMessage, SystemMessage

from langchain_anthropic import ChatAnthropic from langchain.schema import HumanMessage, SystemMessage

llm = ChatAnthropic( model="claude-sonnet-4-20250514", anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"), max_tokens=1024, temperature=0.1 # Low temperature for factual retrieval tasks )

SYSTEM_PROMPT = """You are a helpful assistant that answers questions based strictly on the provided context.

RULES:

ONLY answer based on the context provided. Do not use your general knowledge.
If the context does not contain the answer, respond: "I don't have information about that in the knowledge base. Please contact support for this query."
If you're partially confident, state what the context says and flag what's uncertain.
Always cite which part of the context supports your answer (e.g., "According to the shipping policy section...").
Be concise. Answer in 2-4 sentences unless the question requires more detail.

Never fabricate information, dates, prices, or policies."""

def generate_answer( query: str, retrieved_chunks: List[dict], max_context_chunks: int = 4 ) -> dict: """Generate a grounded answer using retrieved context."""

Limit context to top N chunks to avoid dilution

More chunks ≠ better answers. 3-5 focused chunks outperform 10 scattered ones.

top_chunks = retrieved_chunks[:max_context_chunks]

Format context with source attribution

context_blocks = [] for i, chunk in enumerate(top_chunks, 1): source = chunk['metadata'].get('source', 'Unknown') context_blocks.append(f"[Context {i} — Source: {source}]\n{chunk['text']}")

context_str = "\n\n".join(context_blocks)

messages = [ SystemMessage(content=SYSTEM_PROMPT), HumanMessage(content=f"CONTEXT:\n{context_str}\n\nQUESTION: {query}") ]

response = llm.invoke(messages)

return { "answer": response.content, "sources": [c['metadata'] for c in top_chunks], "retrieval_scores": [c['score'] for c in top_chunks] }`

Enter fullscreen mode

Exit fullscreen mode

Step 6: Evaluation — How Do You Know If Your RAG Is Working?

This is the step 80% of RAG builders skip entirely. A RAG system without evaluation is a black box. You can't improve what you can't measure.

Three metrics we track on every client RAG project:

Retrieval Recall@k — Does the relevant document appear in the top k results?
Answer Faithfulness — Is the answer supported by the retrieved context? (Detects hallucination)
Answer Relevance — Does the answer actually address the question?

from anthropic import Anthropic import json

from anthropic import Anthropic import json

client = Anthropic()

def evaluate_faithfulness(question: str, answer: str, context: str) -> dict: """ Ask Claude to judge whether the answer is supported by the context. This is the LLM-as-judge pattern — imperfect but scalable. """ eval_prompt = f"""You are evaluating whether an AI answer is faithful to the provided context.

CONTEXT: {context}

QUESTION: {question}

ANSWER: {answer}

Evaluate on a scale of 1-5:

5: Fully supported by context, no unsupported claims
3: Mostly supported, minor unsupported details
1: Contains claims not in context (hallucination)

Return ONLY a JSON object: {{"score": <1-5>, "reason": "", "hallucinated_claims": [""]}})"""

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=200, messages=[{"role": "user", "content": eval_prompt}] )

try: result = json.loads(response.content[0].text) return result except json.JSONDecodeError: return {"score": None, "error": "parse_failed", "raw": response.content[0].text}

def run_evaluation_suite(test_cases: List[dict], retriever: HybridRetriever) -> dict: """Run evaluation on a test set. Build this before shipping to production."""

results = [] for test in test_cases: retrieved = retriever.retrieve(test['question'], top_k=5) answer_result = generate_answer(test['question'], retrieved)

context_str = "\n".join([c['text'] for c in retrieved[:4]]) faithfulness = evaluate_faithfulness( test['question'], answer_result['answer'], context_str )

results.append({ "question": test['question'], "expected": test.get('expected_answer'), "actual": answer_result['answer'], "top_retrieval_score": retrieved[0]['score'] if retrieved else 0, "faithfulness_score": faithfulness.get('score'), "hallucinated_claims": faithfulness.get('hallucinated_claims', []) })

avg_faithfulness = sum(r['faithfulness_score'] for r in results if r['faithfulness_score']) / len(results) avg_retrieval = sum(r['top_retrieval_score'] for r in results) / len(results)

return { "total_tests": len(results), "avg_faithfulness": round(avg_faithfulness, 2), "avg_retrieval_score": round(avg_retrieval, 3), "cases": results }`

Enter fullscreen mode

Exit fullscreen mode

The One Mistake That Causes 80% of RAG Failures

After building RAG pipelines across multiple client projects, the failure that appears most often isn't chunking, embedding choice, or prompt design. It's this: developers blame the LLM when the retrieval is broken.

The symptoms look like the model is hallucinating or not following instructions. The actual problem is that the wrong chunks are being retrieved — the LLM is doing its best with bad context and producing a bad answer. You can spend weeks tuning your generation prompt while the retrieval is returning irrelevant chunks and nothing will improve.

Before blaming generation, always check retrieval first:

Run your test queries and print the retrieved chunks
Ask: are these chunks actually relevant to the question?
If no: fix chunking, improve metadata filtering, tune alpha
If yes but answers are still wrong: now look at the generation prompt

This separation of concerns — retrieval quality as an independent metric from generation quality — is the mindset shift that makes RAG systems actually work.

Full Pipeline: Putting It Together

class RAGPipeline:  def __init__(self, index_name: str, alpha: float = 0.5):  self.chunker = SemanticChunker()  self.retriever = None # Initialized after ingestion  self.index_name = index_name  self.alpha = alpha

class RAGPipeline:  def __init__(self, index_name: str, alpha: float = 0.5):  self.chunker = SemanticChunker()  self.retriever = None # Initialized after ingestion  self.index_name = index_name  self.alpha = alpha

def ingest(self, file_paths: List[str], doc_metadata_list: List[dict]): all_chunks = [] for path, metadata in zip(file_paths, doc_metadata_list): chunks = self.chunker.chunk_document(path, metadata) all_chunks.extend(chunks)

vectorstore = ingest_documents(all_chunks) self.retriever = HybridRetriever(vectorstore, index, embedding_model) self.retriever.fit_bm25([c.page_content for c in all_chunks]) print(f"Pipeline ready. {len(all_chunks)} chunks indexed.")

def query(self, question: str, metadata_filter: dict = None) -> dict: if not self.retriever: raise ValueError("Pipeline not initialized. Call ingest() first.")

retrieved = self.retriever.retrieve( question, top_k=5, alpha=self.alpha, metadata_filter=metadata_filter ) return generate_answer(question, retrieved)

Usage

pipeline = RAGPipeline(index_name="rag-knowledge-base") pipeline.ingest( file_paths=["help_center.pdf", "return_policy.pdf", "shipping_guide.pdf"], doc_metadata_list=[ {"doc_type": "help_center"}, {"doc_type": "return_policy"}, {"doc_type": "shipping"} ] )

result = pipeline.query( "What is the return window for damaged items?", metadata_filter={"doc_type": "return_policy"} ) print(result['answer'])`

Enter fullscreen mode

Exit fullscreen mode

What This Costs in Production

For a knowledge base of ~500 pages serving 1,000 queries/day:

Pinecone serverless: ~$5-15/month
OpenAI embeddings (ingestion, one-time): ~$0.50 for 500 pages
Claude Sonnet API (generation, 1,000 queries/day): ~$15-30/month
Total: ~$20-45/month for a production RAG system

This is a core deliverable in our AI automation services. We've built RAG pipelines as part of support automation, internal knowledge management, and product recommendation systems. The architecture above is battle-tested across production deployments — not a tutorial construct.

If you're evaluating whether RAG is the right architecture for your project, see how we approach AI app design or read the architectural comparison between RAG, fine-tuning, and context stuffing.

Frequently Asked Questions

What's the difference between this and just using a ChatPDF-style tool? ChatPDF and similar tools are black boxes — you can't control chunking, retrieval logic, filtering, or evaluation. A custom pipeline gives you full control over every decision: chunk size, embedding model, retrieval alpha, metadata filtering, grounding instructions, and output format. For a client product, that control is not optional.

Can I use this with a local LLM instead of Claude? Yes. Replace ChatAnthropic with ChatOllama or any LangChain-compatible LLM. For the evaluator in Step 6, you need a capable model — local 7B models often produce unreliable faithfulness scores. We recommend keeping Claude for evaluation even if you switch the generation model.

Why use LangChain at all? Could I build this without it? You can. LangChain adds abstraction overhead. For a simple pipeline, raw Anthropic + Pinecone SDK is cleaner. LangChain earns its place when you need LCEL chains, callbacks for logging, or multiple retrieval strategies in one pipeline. Use it if you need its features; skip it for simpler implementations.

How do I handle documents that update frequently? Don't re-ingest the entire corpus. Use Pinecone's delete + upsert with a stable document ID scheme. When a document updates, delete its chunks by ID filter and re-ingest. Tag every chunk with doc_version in metadata so you can audit which version answered which query.

What chunk size should I use for my documents? Test it. Generate 5-10 representative test queries, run retrieval at chunk sizes of 200, 400, 600 tokens, and measure recall@5 for each. The chunk size that returns the relevant document in the top 5 most often is the right size for your corpus. There is no universal answer — anyone who says otherwise hasn't built production RAG.

How do I prevent the RAG from making up information when the answer isn't in the knowledge base? The system prompt in Step 5 handles this: the model is instructed to respond with a defined fallback rather than generating from its general knowledge. Test this explicitly by asking questions you know aren't in the corpus. If the model answers them confidently, tighten the grounding instruction or reduce the temperature.

Rishabh Sethia is Founder & CEO of Innovatrix Infotech. Former SSE / Head of Engineering. DPIIT Recognized Startup. Shopify Partner. AWS Partner.

Original source

DEV Community

https://dev.to/emperorakashi20/building-a-rag-pipeline-from-scratch-with-langchain-pinecone-claude-a-real-implementation-4db0

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudellamamodel

ModelsFresh

AI World Models: What Leaders Should Know - WSJ

AI World Models: What Leaders Should Know WSJ

Google News: Machine Learning

1mabout 4 hours ago

ModelsFresh

Microsoft Launches In-House AI Models to Reduce Reliance on OpenAI - MLQ.ai

Microsoft Launches In-House AI Models to Reduce Reliance on OpenAI MLQ.ai

Google News: LLM

1mabout 3 hours ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT WSJ

Google News: OpenAI

1m4 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 178 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsFresh

Report: Apple Testing AI-Powered ‘Alternative Words’ Feature for iPhone Keyboard

Apple is reportedly testing a smarter iPhone keyboard for iOS 27, with AI-powered word suggestions and improved autocorrect to enhance typing. The post Report: Apple Testing AI-Powered ‘Alternative Words’ Feature for iPhone Keyboard appeared first on TechRepublic .