← Back to work

Major HVAC Manufacturer

·

~12 months

RAG System for Technical Documentation

Built an enterprise RAG system to make thousands of complex technical PDFs searchable and queryable by field technicians.

via Earley Information Science

RAGPDF Vision ParsingOntology-Based Faceted FilteringLLM-as-Judge EvalsAzure

Context

Engaged through Earley Information Science to build a retrieval-augmented generation (RAG) system for a major HVAC manufacturer. The client's field technicians needed to quickly find answers across thousands of complex technical PDFs — installation guides, service manuals, parts catalogs, and engineering specifications.

Problem

The manufacturer's documentation library contained over 10,000 PDFs spanning decades of product lines. These documents were dense with tables, diagrams, wiring schematics, and technical specifications that existing search tools couldn't meaningfully parse. Technicians were spending 15-30 minutes per service call just locating the right information.

Approach

  • Designed an ontology-based retrieval architecture that mapped the manufacturer's product taxonomy, component hierarchy, and technical vocabulary into a structured knowledge representation
  • Built a PDF vision parsing pipeline to extract meaning from tables, diagrams, and figures — not just OCR text extraction, but genuine understanding of visual content
  • Implemented faceted filtering powered by the ontology, allowing technicians to narrow results by product line, document type, component, and issue category
  • Developed an LLM-as-Judge evaluation framework to measure retrieval accuracy, answer quality, and faithfulness to source documents
  • Deployed on Azure with enterprise security controls, SSO integration, and audit logging

Key Technologies

  • Azure OpenAI Service (GPT-4, embeddings)
  • Azure AI Search
  • Custom PDF vision parsing pipeline
  • Ontology-driven faceted filtering
  • LLM-as-Judge evaluation framework (Precision@K, MRR, faithfulness scoring)
  • Karamk API for structured data extraction

Results

  • Significantly reduced information lookup time for field technicians
  • Ontology-based faceted filtering substantially improved retrieval precision over baseline vector search
  • PDF vision parsing resolved the #1 accuracy blocker — tables and diagrams that vector search couldn't meaningfully index
  • LLM-as-Judge evaluation framework provided ongoing measurement of retrieval quality and answer faithfulness

Lessons Learned

The single biggest insight from this engagement was that the quality of your retrieval depends more on how you represent your documents than on which embedding model you use. Ontology-driven chunking and faceted filtering outperformed every pure vector search configuration we tested. This experience directly led to the creation of PDFsSuck — a tool to enhance PDF accessibility metadata so AI systems can understand visual content without runtime vision processing.