Models llama model language model benchmark training available

Building a Fully Local RAG System with Qdrant and Ollama

Dev.to AIby Libardo RamirezApril 2, 202613 min read1 views

Some months ago I was working on a custom solution and I needed to add RAG to it. The requirements were simple but not flexible: everything had to run local, and it had to be deployable in Docker alongside the rest of the services. After looking at some options, I choose Qdrant, and after doing some experiments with it I can say it was a good decision. I know there are more complete solutions to add RAG to a local LLM setup. Frameworks like LangChain or LlamaIndex already abstract most of what I will describe here. But my requirements were not complex, and I did not want to add more dependencies and abstractions on top of a stack I already understand. Keeping things explicit made more sense for this project. This article explains what I learned. It is not a deep technical guide, it is more

I know there are more complete solutions to add RAG to a local LLM setup. Frameworks like LangChain or LlamaIndex already abstract most of what I will describe here. But my requirements were not complex, and I did not want to add more dependencies and abstractions on top of a stack I already understand. Keeping things explicit made more sense for this project.

This article explains what I learned. It is not a deep technical guide, it is more a conceptual explanation for developers who want to understand how Qdrant and Ollama work together before they start coding.

Why Run Everything Local?

My client did not want documents leaving their network, so I did not have much to think about. But even before I started the project, I was already curious about local LLMs. I wanted to understand how far you can go without depending on external services.

The answer is: pretty far. The models available through Ollama are good enough for most practical use cases, and tools like Qdrant make the infrastructure side simple. The cost of "running local" is much lower than I expected, both in setup time and in hardware requirements.

The tradeoff is real though. A local 7B model is not going to perform like GPT-4. For this project that was fine, because the task is retrieval and summarization, not complex reasoning. The model just needs to read some context and write a coherent answer, and for that, smaller models work well.

What is RAG?

RAG is not a new idea. It has been used for a while and is now a well known pattern. I am not saying this is something new. But it is very useful for this type of use case, and it is worth understanding how it works before you start connecting the tools.

A standard LLM only knows what it learned during training, and it can only answer questions from that knowledge. If you ask it about your internal documents, your company wiki, or a PDF you have, it has no idea about that content.

RAG solves this by adding a retrieval step before the model generates an answer: it searches your documents, finds the relevant parts, and gives them to the model as context. The model then uses that context to write the answer, so the response is based on your real data and not just what the model learned before, which reduces hallucinations a lot.

The steps are:

Index your documents - split them into small pieces, convert each piece into a vector (a numerical representation of its meaning), and store those vectors in a vector database.
Receive a question - convert the question into a vector using the same embedding model.
Search - find the stored pieces whose vectors are most similar to the question vector.
Build the prompt - put the found pieces as context before the question and pass everything to the LLM.
Generate the answer - the model reads the context and responds based on it.

User Question  │  ▼ [Embedding Model] ──► Question Vector  │  ▼  [Qdrant] ── similarity search ──► Top-k Chunks  │  ┌────────────────────────────────┘  ▼  Prompt = Context + Question  │  ▼  [Ollama LLM]  │  ▼  Answer

User Question  │  ▼ [Embedding Model] ──► Question Vector  │  ▼  [Qdrant] ── similarity search ──► Top-k Chunks  │  ┌────────────────────────────────┘  ▼  Prompt = Context + Question  │  ▼  [Ollama LLM]  │  ▼  Answer

Enter fullscreen mode

Exit fullscreen mode

The Stack

Qdrant - The Vector Database

Qdrant is an open source vector database built for storing and searching vectors efficiently. In a RAG pipeline it works as the memory of the system: you push the document pieces into it during indexing, and when a question comes it finds the most relevant ones in milliseconds.

What I liked about it is how little friction there is to start. It runs as a single Docker container with no extra configuration, and its REST API is clean enough to use directly without a framework on top. Each stored item can also carry metadata alongside the vector, so you can filter results by things like document type, date, or source, which is useful when you have documents from different contexts in the same collection.

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

Enter fullscreen mode

Exit fullscreen mode

It also comes with a web dashboard at http://localhost:6333/dashboard where you can browse your collections and inspect stored points, which is very useful when you are debugging why a particular chunk is or is not being retrieved.

Ollama - Local LLM Runtime

Ollama is the runtime that makes running language models locally feel simple. It handles model downloads, quantization, and serving, and you interact with it through a CLI or a local HTTP API that has a similar format to the OpenAI API, so most existing tools work with minimal changes.

For this RAG setup, Ollama does two things: it runs the embedding model that converts text into vectors, and the generation model that synthesizes the final answer. Having both in the same runtime keeps the stack simple: one service, one API, and no separate embedding server to manage.

Install it from ollama.com and pull the models:

ollama pull llama3.2 # generation model ollama pull nomic-embed-text # embedding model

ollama pull llama3.2 # generation model ollama pull nomic-embed-text # embedding model

Enter fullscreen mode

Exit fullscreen mode

How They Work Together

The indexing phase happens once, or when your documents change. You start by loading your documents. In my case this was PDFs, text files, and also some MP4 files whose audio I transcribed to text before indexing. Once you have plain text, Qdrant does not care about the original format. You then split the text into overlapping chunks, typically around 512 tokens with some overlap so context is not lost at the boundaries. For each chunk, you call Ollama's embedding API to get a vector (for example 768 dimensions with nomic-embed-text) and save that vector together with the original text and any metadata into a Qdrant collection.

The query phase runs for every user question. You convert the question to a vector using the same Ollama embedding model, pass that vector to Qdrant's search API, and get back the most similar chunks. You then build a prompt by putting those chunks as context before the question, send it to the Ollama generation model, and return the answer to the user.

One important thing to understand: you must use the same embedding model for indexing and for queries, because the vector space it creates only makes sense if both document chunks and questions are embedded in the same space. If you change the model, you need to re-index everything.

Key Things to Know

Chunking

How you split the documents affects the quality of the results more than most people expect. Chunks that are too big bring too much irrelevant text and reduce retrieval precision. Chunks that are too small lose the context needed to answer the question well.

A good starting point is chunks of 512 tokens with 64 tokens of overlap. The overlap makes sure that a sentence split across a boundary is not lost entirely. For structured documents like FAQs or product specs, splitting by logical section usually works better than splitting by character count.

Embedding Model

For a local setup with Ollama, these are the common options:

Model Dimensions Notes

nomic-embed-text 768 Fast, good for general English

mxbai-embed-large 1024 Better quality, needs more resources

nomic-embed-text-v1.5 768 Supports flexible dimension reduction

I used nomic-embed-text, not because I did a detailed comparison, but because I already used it some months earlier when I was learning RAG from a tutorial, it worked well then, and there was no reason to change. Sometimes the familiar option is good enough.

Collections in Qdrant

A collection in Qdrant is similar to a table in a relational database. When you create one you declare the vector size and the distance metric (cosine similarity is the standard for text embeddings):

from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams

from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://localhost:6333")

client.create_collection( collection_name="docs", vectors_config=VectorParams(size=768, distance=Distance.COSINE), )`

Enter fullscreen mode

Exit fullscreen mode

Filtering by Metadata

One of the most useful features of Qdrant for RAG is the ability to filter search results by the metadata you attach to each vector. If you are building a system where different users have their own documents, you can tag each vector with a user_id and filter the search so users only retrieve their own content, without needing a separate collection for each user:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search( collection_name="docs", query_vector=question_embedding, query_filter=Filter( must=[FieldCondition(key="user_id", match=MatchValue(value="alice"))] ), limit=5, )`

Enter fullscreen mode

Exit fullscreen mode

A Simple Example

Here is the basic flow in Python, no framework, just the minimum to make it work end to end:

import requests from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct

import requests from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct

OLLAMA_BASE = "http://localhost:11434" EMBED_MODEL = "nomic-embed-text" CHAT_MODEL = "llama3.2" qdrant = QdrantClient("http://localhost:6333")

1. Create collection

qdrant.recreate_collection( collection_name="docs", vectors_config=VectorParams(size=768, distance=Distance.COSINE), )

2. Helper to embed text

def embed(text: str) -> list[float]: resp = requests.post( f"{OLLAMA_BASE}/api/embeddings", json={"model": EMBED_MODEL, "prompt": text}, ) return resp.json()["embedding"]

3. Index documents

documents = [ "Qdrant is a vector database written in Rust, designed for fast nearest-neighbor search.", "Ollama lets you run large language models locally with a simple CLI and REST API.", "RAG combines information retrieval with text generation to ground LLM answers in real data.", ]

points = [ PointStruct(id=i, vector=embed(doc), payload={"text": doc}) for i, doc in enumerate(documents) ]

qdrant.upsert(collection_name="docs", points=points)

4. Search

question = "What database should I use for semantic search?"

hits = qdrant.search( collection_name="docs", query_vector=embed(question), limit=2, )

context = "\n\n".join(hit.payload["text"] for hit in hits)

prompt = f"""Answer the question using only the context below.

Context: {context}

Question: {question} """

5. Generate answer

response = requests.post( f"{OLLAMA_BASE}/api/generate", json={"model": CHAT_MODEL, "prompt": prompt, "stream": False}, )

print(response.json()["response"])`

Enter fullscreen mode

Exit fullscreen mode

In a real project you would add proper document loading (PyMuPDF for PDFs, python-docx for Word files), better chunking logic, error handling, and a web API layer, but the core logic is exactly this.

Things to Be Careful About

The most important thing is to not change the embedding model after you already indexed your documents. The vectors from different models are not compatible, so if you switch models everything in Qdrant becomes useless and you need to re-index from the beginning. It is a good habit to keep the model name in your configuration and treat it like part of your data schema.

If the answers are not good, the problem is usually in the chunking. Chunks that are too big bring too much irrelevant text and the model gets confused. Chunks that are too small lose context and the answer is incomplete. Try smaller chunks with more overlap, or split by paragraph instead of by character count. This depends a lot on the type of documents you have.

The context window is also something to watch. You are passing retrieved chunks plus the question into the LLM, and if you include too many large chunks you can go over the limit. A safe approach is to retrieve 3 to 5 chunks and keep each one under 400 tokens. llama3.2 has an 8k token context window by default, which is enough if you are careful with the chunk size.

On the hardware side, a 7B model in 4-bit quantization needs around 5 to 6 GB of RAM. Adding Qdrant, which is very lightweight, and the application, the total is around 8 to 10 GB. On a 16 GB machine this is comfortable. If you have less RAM, a smaller model like phi3.5 at 3.8B parameters is a good alternative that still gives useful results.

What I Found in My Experiments

Qdrant was very simple to start with. Just run the Docker image and it works with no configuration needed. For persistent storage you only need to add a volume mount, and in a docker-compose.yml alongside the rest of the services it integrates cleanly without any special networking configuration:

docker run -d -p 6333:6333 -p 6334:6334 \  -v $(pwd)/qdrant_storage:/qdrant/storage \  qdrant/qdrant

docker run -d -p 6333:6333 -p 6334:6334 \  -v $(pwd)/qdrant_storage:/qdrant/storage \  qdrant/qdrant

Enter fullscreen mode

Exit fullscreen mode

The embeddings from Ollama worked well from the first test. I did not need to tune anything. nomic-embed-text already gave useful retrieval results for domain-specific documents without any changes.

Chunk size made a real difference in quality. I tested with 256, 512, and 1024 tokens. With 1024 the results had too much irrelevant surrounding text that diluted the retrieval signal, and with 256 some answers were missing important context. 512 was the best balance for the type of documents I was working with.

The Qdrant dashboard at http://localhost:6333/dashboard was also more useful than I expected. When a retrieval is not working as expected, you can open it and see exactly what is stored and what is being returned for a query. It saves a lot of time compared to adding print statements to the code.

When to Use This Stack

This setup works well for internal knowledge bases, documentation search, or any project where documents cannot leave the company network. It is also good for simple Q&A over a set of documents, or for prototyping when you do not want to pay for API calls while you are still experimenting.

It is not the best choice when you need complex reasoning. Smaller local models are not as capable as GPT-4-class models for that. If your document collection is very large, with millions of vectors, Qdrant supports distributed mode for that but it is a different and more complex setup. And if your project needs support for many languages, it is worth checking the embedding model benchmarks carefully before choosing one, because quality varies a lot between models.

Conclusion

When I needed to add RAG to my project, I wanted something that runs local, works in Docker, and is not too complex to set up. Qdrant was the right choice for that. Together with Ollama, the stack is straightforward: Ollama handles the models for both embedding and generation, and Qdrant handles the storage and search.

It is not the most powerful setup you can build, and I know there are more complete frameworks available. But for requirements like mine, it works very well, the setup time is short, and the result is a RAG system with no external dependencies, no token costs, and no data leaving the infrastructure.

If you are thinking about adding local RAG to a project, this is a good place to start.

References

Qdrant Documentation
Qdrant GitHub - 27,000+ stars as of 2025
Ollama Official Site
Qdrant Python Client
nomic-embed-text on Ollama
mxbai-embed-large on Ollama
Qdrant 2025 Recap: Powering the Agentic Era

Original source

Dev.to AI

https://dev.to/libar/building-a-fully-local-rag-system-with-qdrant-and-ollama-52lf

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellanguage model

ProductsLive

Cortex Code in Snowflake: How to Use It Without Burning Credits

Snowflake Cortex Code (CoCo) is like an AI assistant inside Snowsight (and CLI also). You can ask it to write SQL, create dbt models, explore data, help in ML work, and even do some admin tasks. But one thing people don’t realise early — this tool is powerful, but also costly if used wrongly. Bad prompts → more tokens → more credits → surprise bill. Prompt Engineering (this directly impacts cost) CoCo works on token consumption. what you type → counted 2. what it replies → counted If your prompt is vague → more tool calls → more cost. Example: Bad: Help me with my data Good: Create staging model for RAW.SALES.ORDERS with not_null on ORDER_ID Best Practices: Use full table names 2. Be clear about output 3. Keep prompts small 4. Provide business logic upfront 5. Use AGENTS.md for consistency

Towards AI

3m41 minutes ago

ProductsLive

Best Form Backend for Job Applications and Event Registrations in 2026

If you're collecting job applications or event registrations online, you've probably hit the same wall. Either you're overpaying for a tool like Typeform or JotForm, or you're cobbling together a Google Form that looks unprofessional and gives you zero control over where your data goes. In this article, I'll walk through the best form backends for job applications and event registrations in 2026, covering price, features, file upload support, and which one is right for your use case. Why the Right Form Backend Matters for Applications and Registrations A contact form getting 10 submissions a month is simple. A job application form getting 500 submissions a month is a different problem entirely. You need: File uploads: Candidates submit resumes, cover letters, and portfolios. High submissio

DEV Community

7mabout 1 hour ago

ProductsLive

The Stack Nobody Recommended

The most common question I got after publishing Part 1 was some variation of "why did you pick X instead of Y?" So this post is about that. Every major technology choice, what I actually considered, where I was right, and where I got lucky. I'll be upfront: some of these were informed decisions. Some were "I already know this tool, and I need to move fast." Both are valid, but they lead to different trade-offs down the line. The Backend: FastAPI I come from JavaScript and TypeScript. Years of React on the frontend, Express and Fastify on the backend. When I decided this project would be Python, because that's where the AI/ML ecosystem lives, I needed something that didn't feel foreign. FastAPI clicked immediately. The async/await model, the decorator-based routing, and type hints that actu

DEV Community

9mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 158 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Building a Fully Local RAG System with Qdrant and Ollama

Why Run Everything Local?

What is RAG?

The Stack

Qdrant - The Vector Database

Ollama - Local LLM Runtime

How They Work Together

Key Things to Know

Chunking

Embedding Model

Collections in Qdrant

Filtering by Metadata

A Simple Example

1. Create collection

2. Helper to embed text

3. Index documents

4. Search

5. Generate answer

Things to Be Careful About

What I Found in My Experiments

When to Use This Stack

Conclusion

References

Daily AI Digest

More about

Cortex Code in Snowflake: How to Use It Without Burning Credits

Best Form Backend for Job Applications and Event Registrations in 2026

The Stack Nobody Recommended

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

XRP Price Prediction: ChatGPT Targets XRP at $3.50 to $6 if - openPR.com

ChatGPT Warns Taxing SNAP Benefits Reduces Aid - Let's Data Science

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ