FAQ Chatbot: RAG & Embeddings Pipeline

Overview
This project is a deliberately small RAG system built to explore how LLMs can produce reliable responses without hallucinating.
The chatbot provides trustworthy answers by grounding responses in a local document knowledge base. It parses Markdown documents, splits them into structured chunks, generates embeddings and stores them in ChromaDB for semantic retrieval. Queries are matched against the vector store, and the retrieved context is passed to a local Ollama LLM to generate answers.
Each response includes citations to the source documents with heading & line ranges - ensuring the accuracy of answers can be verified against the knowledge base.
What it does
- Document Ingestion – Markdown files are parsed and split into chunks, then hashed to track updates. Only new or modified chunks are processed.
- Embedding & Retrieval – Chunks are embedded using a local Ollama model and stored in ChromaDB. Queries are embedded and matched against stored chunks with a similarity threshold.
- Answer Generation – Retrieved chunks are passed to a local LLM, which produces answers strictly grounded in the documents.
- API & Debugging – FastAPI handles queries and returns structured responses with citations. Debug endpoints provide insight into retrieved chunks and similarity scores.
This design allows each component to focus on a specific responsibility while keeping the deployment simple and containerised.
Key Features
- Grounded Answers: Responses are based solely on retrieved documents.
- Citation-Based: Each answer references document, heading and line range.
- Persistent Vector Store: ChromaDB ensures fast and efficient semantic search.
- Incremental Updates: Only new or modified chunks are processed.
- Similarity Filtering: Low-confidence matches are excluded for reliability.
- Debug & Inspectability: View retrieved chunks and similarity scores.
- Single-Container Deployment: Simple setup while maintaining modular internal structure.
Technical Highlights
- Containerised system built with Docker for deployment.
- Local embeddings and LLM inference with Ollama models (all-minilm:22m for embeddings, qwen2.5:0.5b-instruct for answers).
- Persistent ChromaDB vector store for semantic retrieval.
- Modular design enforces clear separation of responsibilities and decoupled services.
- Citation-based answer generation reduces hallucinations and ensures traceability.
Tech Stack
Python, FastAPI, Docker, ChromaDB, Ollama, Markdown, RAG, Qwen2.5 0.5B, all-minilm:22m
Possible Extensions
- Multi-user support and authentication
- Support for additional document formats (PDF, HTML)
- Multi-turn conversational queries
- Real-time document ingestion and monitoring