RAG in Production: From PoC to Enterprise System

RAG in Produktion: Vom PoC zum Enterprise-System

Photo by Zdeněk Macháček on Unsplash

The leap from a RAG prototype to a production system is bigger than many expect. In this article, I share my experiences from building a RAG system for a pharmaceutical publishing house -- with over 135 technical books and approximately 2,000 journal articles as the knowledge base.

The Starting Situation

A publishing house in the pharmaceutical sector faced a classic challenge: decades of valuable expertise, spread across various publication formats, needed to be made accessible through an AI-powered search.

The Requirements:

  • 135 technical books (primarily PDF)
  • ~2,000 journal articles (XML, JSON, HTML)
  • Many complex tables with pharmaceutical data
  • Target: 5,000-10,000 queries per day
  • Response time under 15 seconds

What started as a manageable project turned into a lesson in enterprise RAG architecture.

Architecture Decisions

The Tech Stack

After evaluating various options, we chose:

ComponentTechnologyRationale
Vector StoreElasticsearchExisting expertise, hybrid search possible
EmbeddingsOpenAIQuality, easy integration
LLMGPT-4Domain-specific language competence
InfrastructureAWS (EC2, later managed)Scalability

The First Architecture: Docker on EC2

Our initial architecture looked like this:

Initial RAG architecture: Docker containers on AWS EC2 with GPU instance

The Problem: The GPU instance for embeddings ran permanently -- even when no documents were being processed. At current GPU prices, a significant cost factor.

The Optimized Architecture

After the first few months, we migrated to:

Optimized production architecture with Elastic Cloud, Lambda, and OpenAI

The Advantages:

  • Elastic Cloud instead of self-hosted Elasticsearch: Less ops overhead, automatic scaling
  • On-Demand Ingestion: Embedding pipeline runs only when needed, not 24/7
  • OpenAI API for Embeddings: No own GPU infrastructure needed

Multi-Format Processing: The Underestimated Complexity

The Format Chaos

Our document corpus consisted of:

FormatSourceChallenge
PDFTechnical booksLayout detection, tables, graphics
XMLArticle exportsStructured but deeply nested
JSONAPI exportsDifferent schemas
HTMLWeb contentInconsistent structures

Chunking Strategy: One Size Fits None

The most important insight: Different content types need different chunking strategies.

For prose text (books, articles):

Strategie: Semantisches Chunking
- Chunk-Größe: 500-800 Tokens
- Overlap: 100 Tokens
- Trennung an Absatzgrenzen

For structured data (XML, JSON):

Strategie: Strukturerhaltend
- Chunk an logischen Einheiten (Kapitel, Abschnitte)
- Metadaten als Kontext mitführen
- Hierarchie bewahren

For tables:

Strategie: Große Chunks
- Tabellen als Ganzes oder in logischen Teilen
- HTML-Format für Strukturerhalt
- Zusätzlicher Beschreibungstext für Kontext

Tables: The Underestimated Opponent

The Problem

Pharmaceutical literature is packed with tables: drug interactions, dosage tables, lab values. These tables often contain the most valuable information -- but they are notoriously difficult for RAG systems to handle.

Typical Problems:

  • PDF extraction destroys table structure
  • Cell contents are incorrectly assigned
  • Column and row relationships are lost

Our Solution: LLM-Assisted Table Recognition

We developed a multi-step process:

LLM-assisted table extraction: From PDF to structured HTML table

Step 1: Layout Analysis Identification of table regions on the page.

Step 2: LLM Structuring A specialized prompt extracts the table structure:

Analysiere diese Tabelle und konvertiere sie in valides HTML.
Behalte alle Spalten- und Zeilenbeziehungen bei.
Füge eine kurze Beschreibung des Tabelleninhalts hinzu.

Step 3: Large Chunks Tables are stored as large chunks (up to 2,000 tokens) to preserve context.

The Results

MetricBefore LLM ExtractionAfter LLM Extraction
Structure preservation~40%~90%
Correct cell assignment~50%~85%
Retrieval quality for table querieslowhigh

Lesson Learned: The investment in high-quality table extraction pays off. For domain-specific content, tables are often the primary information source.

Latency: Two Worlds

The Latency Profile

Our measurements revealed an interesting pattern:

PhaseLatencyShare
Query embedding~100ms1%
Vector search~200ms2%
Reranking~700ms7%
LLM generation~10,000ms90%

The Insight: 90% of latency comes from the LLM call, not from retrieval.

Optimization Strategies

1. Streaming Responses

Statt: Warten auf komplette Antwort (10+ Sekunden)
Besser: Streaming ab dem ersten Token (~500ms Time-to-First-Token)

2. Context Optimization

Weniger ist mehr:
- Top-3 statt Top-10 Dokumente
- Präzise Chunks statt langer Passagen
- Reduktion des Context Windows = schnellere Antwort

3. Caching Strategies

- Embedding-Cache für häufige Anfragen
- Response-Cache für identische Queries
- Dokumenten-Snippets vorberechnen

Cost Optimization: The Hard Lessons

The Cost Shock

After the first month in production, a sobering bill:

ItemMonthly Cost
GPU instance (24/7)Significant
OpenAI API (Embeddings)Moderate
OpenAI API (LLM)High
ElasticsearchModerate

The Optimizations

1. From 24/7 GPU to On-Demand

The biggest savings: ingestion pipeline only when needed.

Vorher: GPU-Instance läuft permanent
Nachher: Lambda-Trigger bei neuen Dokumenten

Ersparnis: >70% der Compute-Kosten

2. Managed Services vs. Self-Hosted

AspectSelf-HostedManaged
Initial costsLowerHigher
Ops overheadHighMinimal
ScalingManualAutomatic
Total cost (1 year)Often higherOften lower

3. Reduce Embedding Costs

- Batch-Processing für neue Dokumente
- Inkrementelle Updates statt vollständiger Re-Indexierung
- Embedding-Cache für häufige Anfragen

Monitoring: What Really Matters

The Most Important Metrics

Quality Metrics:

  • Retrieval precision (relevant documents in top-K?)
  • Answer relevance (does the answer address the question?)
  • Hallucination rate (are unsupported facts mentioned?)

Performance Metrics:

  • Time-to-First-Token (user experience)
  • End-to-end latency
  • Throughput (queries per second)

Business Metrics:

  • User satisfaction
  • Successful answers per day
  • Escalations to human experts

Alerting Setup

Kritisch:
- Latenz > 20 Sekunden
- Fehlerrate > 5%
- Elasticsearch nicht erreichbar

Warnung:
- Latenz > 15 Sekunden
- Retrieval-Qualität sinkt
- Ungewöhnliche Anfragemuster

Lessons Learned

What We Would Do Differently

1. Adopt Managed Services Earlier The self-hosted approach was educational, but the ops overhead was disproportionate to the benefit.

2. Spend More Time on Document Preparation The quality of input data largely determines the quality of answers. Investment here pays off doubly.

3. Prioritize Tables from the Start We initially treated tables as a "special case." But for many use cases, they were the core content.

4. Set Realistic Latency Expectations 10+ second response times are normal for complex LLM generations. Better to implement streaming than to attempt unrealistic optimizations.

What Worked Well

  • Hybrid search (vector + keyword) significantly improved retrieval quality
  • Chunk overlap reduced context loss at chunk boundaries
  • Strict source citations increased user trust
  • Feedback loop with domain experts continuously improved quality

Conclusion

The path from RAG PoC to production system requires decisions across many dimensions: infrastructure, data processing, cost optimization, and monitoring. The most important insight: There is no one-size-fits-all solution. Every document corpus, every domain, and every user base brings its own challenges.

What matters is an iterative approach: start with a solid foundational architecture, measure consistently, and optimize where it has the greatest impact. And never underestimate the complexity of document preparation -- it is often the key to success.


Planning a RAG system for your domain content? Contact me for a no-obligation consultation on architecture and implementation.