RAG in Production: From PoC to Enterprise System

Photo by Zdeněk Macháček on Unsplash
The leap from a RAG prototype to a production system is bigger than many expect. In this article, I share my experiences from building a RAG system for a pharmaceutical publishing house -- with over 135 technical books and approximately 2,000 journal articles as the knowledge base.
The Starting Situation
A publishing house in the pharmaceutical sector faced a classic challenge: decades of valuable expertise, spread across various publication formats, needed to be made accessible through an AI-powered search.
The Requirements:
- 135 technical books (primarily PDF)
- ~2,000 journal articles (XML, JSON, HTML)
- Many complex tables with pharmaceutical data
- Target: 5,000-10,000 queries per day
- Response time under 15 seconds
What started as a manageable project turned into a lesson in enterprise RAG architecture.
Architecture Decisions
The Tech Stack
After evaluating various options, we chose:
| Component | Technology | Rationale |
|---|---|---|
| Vector Store | Elasticsearch | Existing expertise, hybrid search possible |
| Embeddings | OpenAI | Quality, easy integration |
| LLM | GPT-4 | Domain-specific language competence |
| Infrastructure | AWS (EC2, later managed) | Scalability |
The First Architecture: Docker on EC2
Our initial architecture looked like this:
The Problem: The GPU instance for embeddings ran permanently -- even when no documents were being processed. At current GPU prices, a significant cost factor.
The Optimized Architecture
After the first few months, we migrated to:
The Advantages:
- Elastic Cloud instead of self-hosted Elasticsearch: Less ops overhead, automatic scaling
- On-Demand Ingestion: Embedding pipeline runs only when needed, not 24/7
- OpenAI API for Embeddings: No own GPU infrastructure needed
Multi-Format Processing: The Underestimated Complexity
The Format Chaos
Our document corpus consisted of:
| Format | Source | Challenge |
|---|---|---|
| Technical books | Layout detection, tables, graphics | |
| XML | Article exports | Structured but deeply nested |
| JSON | API exports | Different schemas |
| HTML | Web content | Inconsistent structures |
Chunking Strategy: One Size Fits None
The most important insight: Different content types need different chunking strategies.
For prose text (books, articles):
Strategie: Semantisches Chunking
- Chunk-Größe: 500-800 Tokens
- Overlap: 100 Tokens
- Trennung an Absatzgrenzen
For structured data (XML, JSON):
Strategie: Strukturerhaltend
- Chunk an logischen Einheiten (Kapitel, Abschnitte)
- Metadaten als Kontext mitführen
- Hierarchie bewahren
For tables:
Strategie: Große Chunks
- Tabellen als Ganzes oder in logischen Teilen
- HTML-Format für Strukturerhalt
- Zusätzlicher Beschreibungstext für Kontext
Tables: The Underestimated Opponent
The Problem
Pharmaceutical literature is packed with tables: drug interactions, dosage tables, lab values. These tables often contain the most valuable information -- but they are notoriously difficult for RAG systems to handle.
Typical Problems:
- PDF extraction destroys table structure
- Cell contents are incorrectly assigned
- Column and row relationships are lost
Our Solution: LLM-Assisted Table Recognition
We developed a multi-step process:
Step 1: Layout Analysis Identification of table regions on the page.
Step 2: LLM Structuring A specialized prompt extracts the table structure:
Analysiere diese Tabelle und konvertiere sie in valides HTML.
Behalte alle Spalten- und Zeilenbeziehungen bei.
Füge eine kurze Beschreibung des Tabelleninhalts hinzu.
Step 3: Large Chunks Tables are stored as large chunks (up to 2,000 tokens) to preserve context.
The Results
| Metric | Before LLM Extraction | After LLM Extraction |
|---|---|---|
| Structure preservation | ~40% | ~90% |
| Correct cell assignment | ~50% | ~85% |
| Retrieval quality for table queries | low | high |
Lesson Learned: The investment in high-quality table extraction pays off. For domain-specific content, tables are often the primary information source.
Latency: Two Worlds
The Latency Profile
Our measurements revealed an interesting pattern:
| Phase | Latency | Share |
|---|---|---|
| Query embedding | ~100ms | 1% |
| Vector search | ~200ms | 2% |
| Reranking | ~700ms | 7% |
| LLM generation | ~10,000ms | 90% |
The Insight: 90% of latency comes from the LLM call, not from retrieval.
Optimization Strategies
1. Streaming Responses
Statt: Warten auf komplette Antwort (10+ Sekunden)
Besser: Streaming ab dem ersten Token (~500ms Time-to-First-Token)
2. Context Optimization
Weniger ist mehr:
- Top-3 statt Top-10 Dokumente
- Präzise Chunks statt langer Passagen
- Reduktion des Context Windows = schnellere Antwort
3. Caching Strategies
- Embedding-Cache für häufige Anfragen
- Response-Cache für identische Queries
- Dokumenten-Snippets vorberechnen
Cost Optimization: The Hard Lessons
The Cost Shock
After the first month in production, a sobering bill:
| Item | Monthly Cost |
|---|---|
| GPU instance (24/7) | Significant |
| OpenAI API (Embeddings) | Moderate |
| OpenAI API (LLM) | High |
| Elasticsearch | Moderate |
The Optimizations
1. From 24/7 GPU to On-Demand
The biggest savings: ingestion pipeline only when needed.
Vorher: GPU-Instance läuft permanent
Nachher: Lambda-Trigger bei neuen Dokumenten
Ersparnis: >70% der Compute-Kosten
2. Managed Services vs. Self-Hosted
| Aspect | Self-Hosted | Managed |
|---|---|---|
| Initial costs | Lower | Higher |
| Ops overhead | High | Minimal |
| Scaling | Manual | Automatic |
| Total cost (1 year) | Often higher | Often lower |
3. Reduce Embedding Costs
- Batch-Processing für neue Dokumente
- Inkrementelle Updates statt vollständiger Re-Indexierung
- Embedding-Cache für häufige Anfragen
Monitoring: What Really Matters
The Most Important Metrics
Quality Metrics:
- Retrieval precision (relevant documents in top-K?)
- Answer relevance (does the answer address the question?)
- Hallucination rate (are unsupported facts mentioned?)
Performance Metrics:
- Time-to-First-Token (user experience)
- End-to-end latency
- Throughput (queries per second)
Business Metrics:
- User satisfaction
- Successful answers per day
- Escalations to human experts
Alerting Setup
Kritisch:
- Latenz > 20 Sekunden
- Fehlerrate > 5%
- Elasticsearch nicht erreichbar
Warnung:
- Latenz > 15 Sekunden
- Retrieval-Qualität sinkt
- Ungewöhnliche Anfragemuster
Lessons Learned
What We Would Do Differently
1. Adopt Managed Services Earlier The self-hosted approach was educational, but the ops overhead was disproportionate to the benefit.
2. Spend More Time on Document Preparation The quality of input data largely determines the quality of answers. Investment here pays off doubly.
3. Prioritize Tables from the Start We initially treated tables as a "special case." But for many use cases, they were the core content.
4. Set Realistic Latency Expectations 10+ second response times are normal for complex LLM generations. Better to implement streaming than to attempt unrealistic optimizations.
What Worked Well
- Hybrid search (vector + keyword) significantly improved retrieval quality
- Chunk overlap reduced context loss at chunk boundaries
- Strict source citations increased user trust
- Feedback loop with domain experts continuously improved quality
Conclusion
The path from RAG PoC to production system requires decisions across many dimensions: infrastructure, data processing, cost optimization, and monitoring. The most important insight: There is no one-size-fits-all solution. Every document corpus, every domain, and every user base brings its own challenges.
What matters is an iterative approach: start with a solid foundational architecture, measure consistently, and optimize where it has the greatest impact. And never underestimate the complexity of document preparation -- it is often the key to success.
Planning a RAG system for your domain content? Contact me for a no-obligation consultation on architecture and implementation.