Menu-to-Insight: OCR & RAG Pipeline

Overview
This project was built to gain experience deploying a multi-service, containerised application using Docker, as well as grounding a local LLM in a knowledge base.
The system uses a RAG pipeline to ingest images via OCR -> retrieve context through Qdrant vector search -> generate food recommendations using a local Ollama model (Qwen-2.5).
A limitation I encountered is that for each query, the LLM has to repeatedly retrieve raw data and rediscover knowledge. To improve this, I plan to implement a wiki layer to accumulate the LLM's insights, summaries and contradictions to provide more contextual answers.
What it does
Image Processing (OCR):
- Ingests images of restaurant menus
- Preprocesses and enhances images to improve text clarity
- Detects text regions using PaddleOCR
- Extracts text from each region
- Reconstructs the extracted text into structured menu items
Analysis & Generation (RAG):
- Converts menu items into embeddings
- Semantically searches a vector database of verified foods
- Infers nutritional information from the most relevant matches
- Applies similarity thresholds to enforce high-confidence retrieval
- Generates personalised recommendations with reasoning constrained to the retrieved context.
Food Logging:
- Allows users to search and log foods with nutritional values
- Provides an encouraging message after each log, reflecting on the meal and suggesting a small improvement
- Dynamically tracks progress against calorie and protein targets
- Allows users to edit entries and personalise their profile
Architecture
Built as a set of containerised services, each service focuses on a seperate responsibility:
- API – Accepts images, orchestrates services and sends context to the LLM.
- Vision – Preprocesses images and extracts text using PaddleOCR and OpenCV.
- Vector – Generates embeddings and performs vector search using Qdrant.
- Database – Stores nutritional data and food logs using SQLite and SQLAlchemy.
- Local LLM – An Ollama-hosted model that generates recommendations grounded in retrieved context.
Each service follows a consistent internal structure:
- Routes expose APIs and handle request/response flow
- Services encapsulate business logic, keeping routes focused on orchestration
- Schemas enforce data structure and ensure consistency across services
Each service is visible via structured logging and debugging endpoints. I focused on keeping services modular to support scalability and maintainability, so components, Docker images and LLM models could be easily swapped while keeping the system robust.
Demonstrates
- Deploying containerised services orchestrated with Docker Compose.
- Detecting, segmenting and extracting text and layout using OCR.
- Using RAG to ground LLM in retrieved context, reducing hallucinations
- Building a full-stack app with a clean, modular frontend and backend.
- Implementing structured logging and debugging endpoints across services.
- Hosting and configuring local Large Language Models
Tech Stack
Python, FastAPI, Docker, Ollama, Qdrant, PaddleOCR, OpenCV, SQLAlchemy, Pandas
Possible Extensions
- Long-term memory via a wiki layer to build upon LLM's insights and context.
- Conversational queries using previous messages and retrieved context to support follow-up questions.
- Multi-user authentication.
- Long-term nutritional analytics and progress tracking.