FAQ Chatbot: RAG & Embeddings Pipeline

PythonFastAPIDockerChromaDBOllamaMarkdownRAGQwen2.5 0.5B

1 / 4

Overview

This project is a deliberately small RAG system built to explore how LLMs can produce reliable responses without hallucinating.

The chatbot provides trustworthy answers by grounding responses in a local document knowledge base. It parses Markdown documents, splits them into structured chunks, generates embeddings and stores them in ChromaDB for semantic retrieval. Queries are matched against the vector store, and the retrieved context is passed to a local Ollama LLM to generate answers.

Each response includes citations to the source documents with heading & line ranges - ensuring the accuracy of answers can be verified against the knowledge base.

What it does

Document Ingestion – Markdown files are parsed and split into chunks, then hashed to track updates. Only new or modified chunks are processed.
Embedding & Retrieval – Chunks are embedded using a local Ollama model and stored in ChromaDB. Queries are embedded and matched against stored chunks with a similarity threshold.
Answer Generation – Retrieved chunks are passed to a local LLM, which produces answers strictly grounded in the documents.
API & Debugging – FastAPI handles queries and returns structured responses with citations. Debug endpoints provide insight into retrieved chunks and similarity scores.

This design allows each component to focus on a specific responsibility while keeping the deployment simple and containerised.

Key Features

Grounded Answers: Responses are based solely on retrieved documents.
Citation-Based: Each answer references document, heading and line range.
Persistent Vector Store: ChromaDB ensures fast and efficient semantic search.
Incremental Updates: Only new or modified chunks are processed.
Similarity Filtering: Low-confidence matches are excluded for reliability.
Debug & Inspectability: View retrieved chunks and similarity scores.
Single-Container Deployment: Simple setup while maintaining modular internal structure.

Technical Highlights

Containerised system built with Docker for deployment.
Local embeddings and LLM inference with Ollama models (all-minilm:22m for embeddings, qwen2.5:0.5b-instruct for answers).
Persistent ChromaDB vector store for semantic retrieval.
Modular design enforces clear separation of responsibilities and decoupled services.
Citation-based answer generation reduces hallucinations and ensures traceability.

Tech Stack

Python, FastAPI, Docker, ChromaDB, Ollama, Markdown, RAG, Qwen2.5 0.5B, all-minilm:22m

Possible Extensions

Support additional document formats (PDF, HTML)
Conversational queries allowing users to ask follow-up questions.