Your AI Chatbot Isn't Stupid. It Just Has No Memory. Here's How We Fixed That.
I had a moment in a session a few weeks ago that I haven't stopped thinking about. Someone asked an AI chatbot what their company's refund policy was. The bot answered confidently, fluently, with zero hesitation. It was also completely wrong. It had invented a policy — 14 days, original packaging, contact support@ — from thin air, because it had never actually seen the company's documentation. It wasn't broken. It was doing exactly what it was designed to do: predict the most plausible-sounding next word. And "most plausible" and "accurate" are not the same thing. That's the dirty secret of LLMs fresh out of training. They're brilliant at sounding right. They're not inherently good at being right — especially about things that aren't in their training data. The fix has a name: RAG. Retriev
I had a moment in a session a few weeks ago that I haven't stopped thinking about. Someone asked an AI chatbot what their company's refund policy was. The bot answered confidently, fluently, with zero hesitation. It was also completely wrong. It had invented a policy — 14 days, original packaging, contact support@ — from thin air, because it had never actually seen the company's documentation. It wasn't broken. It was doing exactly what it was designed to do: predict the most plausible-sounding next word. And "most plausible" and "accurate" are not the same thing. That's the dirty secret of LLMs fresh out of training. They're brilliant at sounding right. They're not inherently good at being right — especially about things that aren't in their training data. The fix has a name: RAG. Retrieval-Augmented Generation. It's the most widely deployed AI architecture in enterprise software right now, and once you understand how it works, you'll see it everywhere.
First, understand the actual problem An LLM is trained on a snapshot of the internet up to some date. After that, it's frozen. It doesn't know what happened yesterday. It doesn't know your company's internal docs. It doesn't know the policy your team updated last Tuesday. When you ask it something it doesn't know, it doesn't say "I don't know." It says whatever sounds most likely based on patterns it absorbed during training. That's hallucination — not a bug, just the nature of next-token prediction without grounding. The naive solution is: just paste all your documents into the prompt. That breaks immediately. Context windows are finite. You can't dump 10,000 internal documents into every request. And even if you could, the model would have trouble focusing on what's actually relevant. So the real solution is: don't give it everything — give it the right thing at the right moment. That's RAG.
What RAG actually does (step by step) Think of it like this. You have a researcher and a librarian working together. The librarian manages a massive archive of your documents — your policies, your product docs, your internal wikis, whatever you've ingested. When a question comes in, the librarian finds the most relevant pages and hands them over. The researcher (the LLM) reads those pages and writes the answer. They don't need to have memorized the entire library. They just need the right sources on their desk. Here's the pipeline, made concrete:
Step 1: Ingest You take your documents and chunk them — break them into smaller pieces, typically 300–500 words each. Why chunk? Because if you store a 50-page employee handbook as one blob, and someone asks about PTO policy, you'd retrieve all 50 pages and waste your entire context window on irrelevant sections. Each chunk gets converted into an embedding — a list of numbers (usually 384 or 768 of them) that captures its meaning in vector space. Similar meanings cluster together. Words like "refund," "return," and "money back" end up near each other even though they're different strings. All these embeddings get stored in a vector database — Chroma if you're prototyping, Pinecone if you're in production.
Step 2: Retrieve User asks: "Can I get my money back?" That question gets converted into an embedding using the same model. Then the system searches the vector database for chunks whose embeddings are closest to the question's embedding. This is the part that trips people up: there are zero overlapping keywords between "Can I get my money back?" and "Our refund policy allows returns within 30 days." But semantically, they're saying the same thing. Semantic search finds it anyway. pythonquery = "Can I get my money back?" query_vector = model.encode([query])
distances, indices = index.search(query_vector, k=2) Returns: doc about refund policy (distance: 0.85) NOT: doc about password resets (distance: 1.82)
Step 3: Augment The retrieved chunks get injected into the prompt alongside the user's question: SYSTEM: You are a helpful customer support agent. Answer using ONLY the provided context. If the answer isn't there, say so.
CONTEXT: "Our refund policy allows returns within 30 days of purchase. Items must be in original packaging. Digital products are non-refundable after download."
USER: Can I get my money back? Step 4: Generate The LLM answers — but now it's grounded. It's not predicting from vibes. It's reading actual documentation and summarizing it:
"Yes, you can get a refund within 30 days of purchase, as long as the item is in its original packaging. Note that digital products can't be refunded after download. Want me to help you start a return?"
Accurate. Specific. Citable.
Why this matters more than people realize Without RAG, the same bot would have said something like "Most companies offer 14-day return windows" — plausible, confident, wrong. The difference isn't the model. It's the context you give it. This is the pattern behind almost every enterprise AI product that actually works. Perplexity does it with the internet in real-time. GitHub Copilot does it with your codebase. Customer support bots do it with your knowledge base. The underlying model is often identical across these products. What differs is what gets retrieved and injected into the prompt. Here's the full working implementation — no frameworks, just the raw four-step pipeline in ~40 lines of Python: pythonfrom sentence_transformers import SentenceTransformer import faiss import numpy as np
STEP 1: INGEST docs = [ "Our refund policy allows returns within 30 days.", "Premium plan costs $29/month with unlimited API calls.", "To reset password: Settings > Security > Change Password.", "AI features use GPT-4 for text and DALL-E for images.", ]
model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(docs)
index = faiss.IndexFlatL2(embeddings.shape[1]) index.add(np.array(embeddings, dtype='float32'))
STEP 2: RETRIEVE query = "Can I get my money back?" query_vector = model.encode([query]) distances, indices = index.search(np.array(query_vector, dtype='float32'), k=2)
STEP 3: AUGMENT retrieved = [docs[i] for i in indices[0]] prompt = f"""Based on: {chr(10).join(retrieved)}
Answer: {query}"""
STEP 4: GENERATE Send prompt to OpenAI/Anthropic/etc. print(prompt) That's it. Every production RAG system — from chatbots to research assistants — is this same pattern, scaled.
The honest limitations RAG isn't magic. It fails in predictable ways: Chunking matters more than you think. If you chunk carelessly — splitting mid-sentence, or making chunks too large — retrieval quality tanks. The model can only answer from what it retrieves, and it can only retrieve what's in the chunks. Garbage in, garbage out. If your documentation is inconsistent, outdated, or contradictory, the bot will faithfully reflect that chaos. RAG doesn't fix bad source material. Retrieval isn't always enough. Some questions need synthesis across multiple documents, not just retrieval of one chunk. That's where more sophisticated pipelines — re-ranking, multi-hop retrieval, agentic approaches — come in.
The mental model to carry forward The LLM is the researcher. The vector database is the library. RAG is the system that ensures the researcher always has the right books open before they start writing. Without it, you have a very articulate person answering confidently from memory alone — and memory, as we know, is unreliable. With it, you have the same person — but now they're actually reading the source material. That's the difference between an AI that sounds good and an AI that's actually useful.
Building something with RAG? Drop your setup in the comments — curious what stacks people are running in production.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modeltransformertraining
Am I the baddie?
I am a software engineer. I work for a company that makes software for road construction. Monday last week we were under a bad crunch and we were told to start using agentic workflows. We had like 50 tickets to close by the following Tuesday. I’ve been experimenting with ai development for years now, but this was different. I had access to Opus/Sonnet 4.6, and GPT5.4—the latest models. Suddenly, they understood. I could talk about abstract concept’s and analogies, and it got them. I was soon working through tickets the first day in hours, what would have taken me days. But we still had a ton of work and not enough time. I was still bound to a single thread of work at a time. So like any problem, I hacked around it. I started with a worktree, where it basically creates a whole other copy of
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Am I the baddie?
I am a software engineer. I work for a company that makes software for road construction. Monday last week we were under a bad crunch and we were told to start using agentic workflows. We had like 50 tickets to close by the following Tuesday. I’ve been experimenting with ai development for years now, but this was different. I had access to Opus/Sonnet 4.6, and GPT5.4—the latest models. Suddenly, they understood. I could talk about abstract concept’s and analogies, and it got them. I was soon working through tickets the first day in hours, what would have taken me days. But we still had a ton of work and not enough time. I was still bound to a single thread of work at a time. So like any problem, I hacked around it. I started with a worktree, where it basically creates a whole other copy of

Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia
arXiv:2604.01630v1 Announce Type: new Abstract: Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional pre

How to Automate Upwork Proposals with Python (Real Code Inside)
How to Automate Upwork Proposals with Python (Real Code Inside) Last month I sent 47 proposals on Upwork. I personally wrote 3 of them. The other 44 were drafted by Claude AI, filtered through a scoring algorithm I built over two weekends, and delivered to my inbox via Telegram before most freelancers even saw the job posting. My response rate on those AI-assisted proposals? 31%. Higher than my hand-written average from the previous quarter. This article shows you exactly how I built that system. The Real Problem With Upwork Proposals If you've freelanced on Upwork for more than a month, you know the grind. You refresh the job feed. You see something promising. You spend 20 minutes writing a tailored proposal. You hit submit. Nothing. Meanwhile, the client already hired someone who respond

Migration and Modernisation with Kiro CLI
Background Once upon a time, there was a developer who needed to keep updating the dependencies of each tool/product/software. There is a dependabot which still helpful for updating minor versions. However, it will need a manual update/migration whenever a major version comes. Migrating to a major version is frustrating for me if I need to update it in bulk. Updating only one app is pretty fine, but how about multiple apps? I believe we will stop doing it. AI Era The AI (Artificial Intelligence) era has come. Much automation can be achieved by AI. I have a good belief that I can migrate much more easily whenever I use AI. Not like the old age, which needs many manual changes, especially the breaking changes! Migration as Vibes I'm starting the migration as vibes. So, I only put a simple pr




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!