From Vectors to Answers: Building a Local RAG Agent

Written on February 8, 2026 @ 21:00 by Bastiaan

Last updated on February 8, 2026 @ 21:00

Introduction

In my previous post, I showed how semantic search works by turning text into vectors and comparing those vectors to find meaningfully similar content: https://bastiaan.dev/blog/from-text-to-vectors

That pipeline gets you results.

But most of the time, you don’t want results.

You want an answer.

And you want that answer to be grounded in your knowledge base: docs, policies, manuals, product specs, runbooks, ticket history — whatever your users actually care about.

That’s where RAG (Retrieval‑Augmented Generation) comes in.

Where this is useful (real examples)

RAG + vector search is one of those techniques that scales from tiny side project to serious production system.

A few practical examples:

Customer support / helpdesk chatbots that answer questions using your internal knowledge base (refund policy, troubleshooting steps, warranty rules).
Internal assistants for engineers (searching runbooks, ADRs, onboarding docs, architecture docs).
Product search and recommendations that work even when people don’t type the “right” keywords.
Document Q&A (PDFs, meeting notes, release notes) where you want answers that cite the source text.
Compliance and policy assistants that reduce “confidently wrong” replies by retrieving the exact policy chunk first.

The common pattern: you want the model to answer with context, not just vibes.

A practical mental model for RAG

Retrieval-Augmented Generation (RAG) is a way to make AI language models smarter, more accurate, and more up-to-date by letting them look things up before they answer.

RAG is basically semantic search plus one extra step.

A typical RAG pipeline looks like this:

Embed your documents into vectors.
Store those vectors in a database.
When the user asks a question: embed the question too.
Retrieve the most similar vectors (top‑k).
Generate an answer using the retrieved text as context.

So: search first, answer second.

Why build RAG locally?

Cloud RAG stacks are great, but local RAG has some strong advantages:

Privacy: your knowledge base stays on your machine.
Speed: no network hops for embeddings or retrieval.
Simplicity: a single SQLite file is hard to beat.
Portability: you can ship the database with your app.

The sweet spot is when your data is sensitive, your latency budget is tight, or you want a minimal stack that still feels “real”.

The building blocks

For this post, we’ll look at a small demo project that ties the full loop together:

local-ollama-rag-agent https://github.com/basst85/local-ollama-rag-agent

It uses:

Ollama for running models locally (embeddings + chat).
EmbeddingGemma for embeddings (fast, on-device friendly).
Bun as the runtime (TypeScript + a great SQLite driver).
SQLite as the vector store.
sqlite-vector as the vector search extension.

Let’s quickly unpack the interesting parts.

Embeddings: why EmbeddingGemma is a nice fit

EmbeddingGemma is designed specifically for on-device retrieval use cases like RAG and semantic search.

The detail I like most is flexible output dimensions using Matryoshka Representation Learning (MRL). In practice that means you can trade off speed/storage vs quality by using a smaller slice of the embedding (for example 256 dims instead of 768).