RAG in Produktion: Vom PoC zum Enterprise-System

The leap from a RAG prototype to a production system is bigger than many expect. In this article, I share my experiences from building a RAG system for a pharmaceutical publishing house -- with over 135 technical books and approximately 2,000 journal articles as the knowledge base.

The Starting Situation

A publishing house in the pharmaceutical sector faced a classic challenge: decades of valuable expertise, spread across various publication formats, needed to be made accessible through an AI-powered search.

The Requirements:

135 technical books (primarily PDF)
~2,000 journal articles (XML, JSON, HTML)
Many complex tables with pharmaceutical data
Target: 5,000-10,000 queries per day
Response time under 15 seconds

What started as a manageable project turned into a lesson in enterprise RAG architecture.

Architecture Decisions

The Tech Stack

After evaluating various options, we chose:

Component	Technology	Rationale
Vector Store	Elasticsearch	Existing expertise, hybrid search possible
Embeddings	OpenAI	Quality, easy integration
LLM	GPT-4	Domain-specific language competence
Infrastructure	AWS (EC2, later managed)	Scalability

The First Architecture: Docker on EC2

Our initial architecture looked like this:

Initial RAG architecture: Docker containers on AWS EC2 with GPU instance

The Problem: The GPU instance for embeddings ran permanently -- even when no documents were being processed. At current GPU prices, a significant cost factor.

The Optimized Architecture

After the first few months, we migrated to:

Optimized production architecture with Elastic Cloud, Lambda, and OpenAI

The Advantages:

Elastic Cloud instead of self-hosted Elasticsearch: Less ops overhead, automatic scaling
On-Demand Ingestion: Embedding pipeline runs only when needed, not 24/7
OpenAI API for Embeddings: No own GPU infrastructure needed

Multi-Format Processing: The Underestimated Complexity

The Format Chaos

Our document corpus consisted of:

Format	Source	Challenge
PDF	Technical books	Layout detection, tables, graphics
XML	Article exports	Structured but deeply nested
JSON	API exports	Different schemas
HTML	Web content	Inconsistent structures

Chunking Strategy: One Size Fits None

The most important insight: Different content types need different chunking strategies.

For prose text (books, articles):

Strategie: Semantisches Chunking
- Chunk-Größe: 500-800 Tokens
- Overlap: 100 Tokens
- Trennung an Absatzgrenzen

For structured data (XML, JSON):

Strategie: Strukturerhaltend
- Chunk an logischen Einheiten (Kapitel, Abschnitte)
- Metadaten als Kontext mitführen
- Hierarchie bewahren

For tables:

Strategie: Große Chunks
- Tabellen als Ganzes oder in logischen Teilen
- HTML-Format für Strukturerhalt
- Zusätzlicher Beschreibungstext für Kontext

Tables: The Underestimated Opponent

The Problem

Pharmaceutical literature is packed with tables: drug interactions, dosage tables, lab values. These tables often contain the most valuable information -- but they are notoriously difficult for RAG systems to handle.

Typical Problems:

PDF extraction destroys table structure
Cell contents are incorrectly assigned
Column and row relationships are lost

Our Solution: LLM-Assisted Table Recognition

We developed a multi-step process:

LLM-assisted table extraction: From PDF to structured HTML table

Step 1: Layout Analysis Identification of table regions on the page.

Step 2: LLM Structuring A specialized prompt extracts the table structure:

Analysiere diese Tabelle und konvertiere sie in valides HTML.
Behalte alle Spalten- und Zeilenbeziehungen bei.
Füge eine kurze Beschreibung des Tabelleninhalts hinzu.

Step 3: Large Chunks Tables are stored as large chunks (up to 2,000 tokens) to preserve context.

The Results

Metric	Before LLM Extraction	After LLM Extraction
Structure preservation	~40%	~90%
Correct cell assignment	~50%	~85%
Retrieval quality for table queries	low	high

Lesson Learned: The investment in high-quality table extraction pays off. For domain-specific content, tables are often the primary information source.

Latency: Two Worlds

The Latency Profile

Our measurements revealed an interesting pattern:

Phase	Latency	Share
Query embedding	~100ms	1%
Vector search	~200ms	2%
Reranking	~700ms	7%
LLM generation	~10,000ms	90%

The Insight: 90% of latency comes from the LLM call, not from retrieval.

Optimization Strategies

1. Streaming Responses

Statt: Warten auf komplette Antwort (10+ Sekunden)
Besser: Streaming ab dem ersten Token (~500ms Time-to-First-Token)

2. Context Optimization

Weniger ist mehr:
- Top-3 statt Top-10 Dokumente
- Präzise Chunks statt langer Passagen
- Reduktion des Context Windows = schnellere Antwort

3. Caching Strategies

- Embedding-Cache für häufige Anfragen
- Response-Cache für identische Queries
- Dokumenten-Snippets vorberechnen

Cost Optimization: The Hard Lessons

The Cost Shock

After the first month in production, a sobering bill:

Item	Monthly Cost
GPU instance (24/7)	Significant
OpenAI API (Embeddings)	Moderate
OpenAI API (LLM)	High
Elasticsearch	Moderate

The Optimizations

1. From 24/7 GPU to On-Demand

The biggest savings: ingestion pipeline only when needed.

Vorher: GPU-Instance läuft permanent
Nachher: Lambda-Trigger bei neuen Dokumenten

Ersparnis: >70% der Compute-Kosten

2. Managed Services vs. Self-Hosted

Aspect	Self-Hosted	Managed
Initial costs	Lower	Higher
Ops overhead	High	Minimal
Scaling	Manual	Automatic
Total cost (1 year)	Often higher	Often lower

3. Reduce Embedding Costs

- Batch-Processing für neue Dokumente
- Inkrementelle Updates statt vollständiger Re-Indexierung
- Embedding-Cache für häufige Anfragen

Monitoring: What Really Matters

The Most Important Metrics

Quality Metrics:

Retrieval precision (relevant documents in top-K?)
Answer relevance (does the answer address the question?)
Hallucination rate (are unsupported facts mentioned?)

Performance Metrics:

Time-to-First-Token (user experience)
End-to-end latency
Throughput (queries per second)

Business Metrics:

User satisfaction
Successful answers per day
Escalations to human experts

Alerting Setup

Kritisch:
- Latenz > 20 Sekunden
- Fehlerrate > 5%
- Elasticsearch nicht erreichbar

Warnung:
- Latenz > 15 Sekunden
- Retrieval-Qualität sinkt
- Ungewöhnliche Anfragemuster

Lessons Learned

What We Would Do Differently

1. Adopt Managed Services Earlier The self-hosted approach was educational, but the ops overhead was disproportionate to the benefit.

2. Spend More Time on Document Preparation The quality of input data largely determines the quality of answers. Investment here pays off doubly.

3. Prioritize Tables from the Start We initially treated tables as a "special case." But for many use cases, they were the core content.

4. Set Realistic Latency Expectations 10+ second response times are normal for complex LLM generations. Better to implement streaming than to attempt unrealistic optimizations.

What Worked Well

Hybrid search (vector + keyword) significantly improved retrieval quality
Chunk overlap reduced context loss at chunk boundaries
Strict source citations increased user trust
Feedback loop with domain experts continuously improved quality

Conclusion

The path from RAG PoC to production system requires decisions across many dimensions: infrastructure, data processing, cost optimization, and monitoring. The most important insight: There is no one-size-fits-all solution. Every document corpus, every domain, and every user base brings its own challenges.

What matters is an iterative approach: start with a solid foundational architecture, measure consistently, and optimize where it has the greatest impact. And never underestimate the complexity of document preparation -- it is often the key to success.

Planning a RAG system for your domain content? Contact me for a no-obligation consultation on architecture and implementation.