RAG System for Technical Documentation — Dominique Legault

Context

Engaged through Earley Information Science to build a retrieval-augmented generation (RAG) system for a major HVAC manufacturer. The client's field technicians needed to quickly find answers across thousands of complex technical PDFs — installation guides, service manuals, parts catalogs, and engineering specifications.

Problem

The manufacturer's documentation library contained over 10,000 PDFs spanning decades of product lines. These documents were dense with tables, diagrams, wiring schematics, and technical specifications that existing search tools couldn't meaningfully parse. Technicians were spending 15-30 minutes per service call just locating the right information.

Approach

Designed an ontology-based retrieval architecture that mapped the manufacturer's product taxonomy, component hierarchy, and technical vocabulary into a structured knowledge representation
Built a PDF vision parsing pipeline to extract meaning from tables, diagrams, and figures — not just OCR text extraction, but genuine understanding of visual content
Implemented faceted filtering powered by the ontology, allowing technicians to narrow results by product line, document type, component, and issue category
Developed an LLM-as-Judge evaluation framework to measure retrieval accuracy, answer quality, and faithfulness to source documents
Deployed on Azure with enterprise security controls, SSO integration, and audit logging

Key Technologies

Azure OpenAI Service (GPT-4, embeddings)
Azure AI Search
Custom PDF vision parsing pipeline
Ontology-driven faceted filtering
LLM-as-Judge evaluation framework (Precision@K, MRR, faithfulness scoring)
Karamk API for structured data extraction

Results

Significantly reduced information lookup time for field technicians
Ontology-based faceted filtering substantially improved retrieval precision over baseline vector search
PDF vision parsing resolved the #1 accuracy blocker — tables and diagrams that vector search couldn't meaningfully index
LLM-as-Judge evaluation framework provided ongoing measurement of retrieval quality and answer faithfulness

Lessons Learned

The single biggest insight from this engagement was that the quality of your retrieval depends more on how you represent your documents than on which embedding model you use. Ontology-driven chunking and faceted filtering outperformed every pure vector search configuration we tested. This experience directly led to the creation of PDFsSuck — a tool to enhance PDF accessibility metadata so AI systems can understand visual content without runtime vision processing.