RAG Evaluation: Metrics and Testing for AI Systems

RAG Evaluation: Metriken und Testing für KI-Systeme

Photo by CHUTTERSNAP on Unsplash

You've built a RAG system, the first answers look promising, but how do you know if it really works well? "Looks good" isn't enough when your system answers customer inquiries in production or delivers medical information.

In my work with RAG systems in production, I've learned: without systematic evaluation, you're flying blind. This article shows you how to test RAG systems with proven frameworks and metrics, from initial evaluation to CI/CD integration.

Why RAG Evaluation Is Different

Traditional software tests check deterministic outputs: assert calculate_tax(100) == 19. With RAG systems, this is fundamentally different:

  • Non-deterministic outputs: The same query can produce different but equally correct answers
  • Two sources of error: Problems can lie in retrieval or in generation
  • Semantic correctness: "Berlin is the capital" and "The capital of Germany is Berlin" are both correct. A string comparison fails here

The RAG Triad

TruLens developed an elegant model for RAG quality: the RAG Triad. It checks three dimensions that together cover all sources of error:

The RAG Triad: Context Relevance, Groundedness, and Answer Relevance as three pillars of quality assurance
DimensionChecksFailure Case
Context RelevanceAre the retrieved chunks relevant to the question?Irrelevant context → hallucination risk
GroundednessIs the answer based on the retrieved context?Fabricated facts, not supported by sources
Answer RelevanceDoes the answer actually address the question?Correct info, but off-topic

When all three dimensions are positive, your RAG system is demonstrably free of hallucinations, up to the limits of your knowledge base.

The Four Core Metrics

The open-source frameworks RAGAS and DeepEval have established themselves as the standard. Both work with four core metrics:

Faithfulness

What is measured? How factually accurate is the generated answer compared to the retrieved context?

How does it work?

  1. The answer is broken down into individual claims
  2. Each claim is checked against the context (NLI-based)
  3. The score is the ratio of supported claims to total claims
Faithfulness = Supported Claims / Total Claims

Example:

  • Context: "Einstein was born on March 14, 1879 in Germany"
  • Answer A: "Einstein was born on March 14, 1879 in Germany" → 1.0 (all claims supported)
  • Answer B: "Einstein was born on March 20, 1879 in Germany" → 0.5 (date not supported)

Answer Relevancy

What is measured? How relevant is the answer to the question asked?

High faithfulness alone is not enough. The answer must also address the actual question. A system could deliver factually accurate but completely irrelevant information.

Example:

  • Question: "What happens if the shoes don't fit?"
  • Answer A: "We offer a 30-day return policy at no extra cost." → High relevance
  • Answer B: "Our shoes come in sizes 36-48." → Low relevance (doesn't answer the question)

Context Precision

What is measured? Are relevant chunks ranked higher than irrelevant ones?

This metric evaluates the quality of your retriever and reranker. An irrelevant chunk at position 1 reduces precision to ~0.5, while at position 2 it barely matters.

Context Precision@K = Σ(Precision@k × v_k) / Relevant Items in Top-K

Context Recall

What is measured? Were all relevant pieces of information retrieved?

Context Recall compares the facts in the expected answer (ground truth) with the retrieved chunks. Requires a reference answer.

MetricEvaluatesRequires Ground Truth
FaithfulnessGenerationNo
Answer RelevancyGenerationNo
Context PrecisionRetrievalYes
Context RecallRetrievalYes

RAGAS in Practice

RAGAS (Retrieval Augmented Generation Assessment Score) is the most widely used open-source framework for RAG evaluation. Installation:

pip install ragas langchain-openai langchain-community

First Evaluation Example

from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    ContextPrecision,
    ContextRecall,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
 
# Evaluator-LLM und Embeddings konfigurieren
evaluator_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
evaluator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
)
 
# Testdaten erstellen
samples = [
    SingleTurnSample(
        user_input="Was ist Retrieval-Augmented Generation?",
        retrieved_contexts=[
            "RAG kombiniert Informationsabruf mit Textgenerierung. "
            "Relevante Dokumente werden aus einer Wissensbasis abgerufen "
            "und dem LLM als Kontext übergeben.",
            "RAG wurde 2020 von Meta AI vorgestellt und hat sich "
            "als Standard für wissensbasierte KI-Systeme etabliert.",
        ],
        response=(
            "RAG (Retrieval-Augmented Generation) ist eine Technik, "
            "die Informationsabruf mit Textgenerierung kombiniert. "
            "Dabei werden relevante Dokumente aus einer Wissensbasis "
            "abgerufen und einem LLM als Kontext übergeben."
        ),
        reference=(
            "RAG kombiniert Dokumentenabruf mit LLM-Generierung, "
            "um Antworten auf Basis externer Wissensquellen zu liefern."
        ),
    ),
]
 
dataset = EvaluationDataset(samples=samples)
 
# Evaluation durchführen
results = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(llm=evaluator_llm),
        AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings),
        ContextPrecision(llm=evaluator_llm),
        ContextRecall(llm=evaluator_llm),
    ],
)
 
# Ergebnisse anzeigen
df = results.to_pandas()
print(df[[
    "faithfulness",
    "answer_relevancy",
    "context_precision",
    "context_recall",
]])

Typical output:

faithfulnessanswer_relevancycontext_precisioncontext_recall
1.000.921.000.85

Interpreting the Results

Score RangeRatingAction
0.8 to 1.0GoodContinue monitoring
0.6 to 0.8AcceptableOptimization recommended
< 0.6CriticalImmediate action required

Low faithfulness? → Your LLM is hallucinating. Check the prompt template, reduce the temperature, implement guardrails.

Low context precision? → Your retriever is returning irrelevant chunks. Improve embedding quality or add a reranker.

Low context recall? → Relevant information is not being found. Review your chunking strategy and embedding coverage.

Generating Synthetic Test Data

Manually writing hundreds of test questions is labor-intensive. RAGAS can generate synthetic test data directly from your documents:

from langchain_community.document_loaders import DirectoryLoader
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
 
# Dokumente laden
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()
 
# Generator konfigurieren
generator_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o")
)
generator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
)
 
generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
)
 
# Testset aus Dokumenten generieren
testset = generator.generate_with_langchain_docs(
    documents=docs,
    testset_size=50,
)
 
# Als CSV exportieren für Versionierung
df = testset.to_pandas()
df.to_csv("tests/golden_dataset.csv", index=False)
print(f"{len(df)} Testfragen generiert")

The generator creates different question types: simple factual questions, multi-hop questions (requiring multiple chunks), and reasoning questions. This produces a well-balanced test set.

DeepEval: pytest for LLMs

While RAGAS excels at ad-hoc evaluation, DeepEval shines when integrating into existing test workflows. It works like pytest, but for LLMs.

pip install deepeval

Defining Test Cases

# tests/test_rag_quality.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
)
 
# Metriken mit Schwellenwerten definieren
faithfulness = FaithfulnessMetric(threshold=0.8)
relevancy = AnswerRelevancyMetric(threshold=0.7)
precision = ContextualPrecisionMetric(threshold=0.65)
 
# Testdaten
TEST_CASES = [
    {
        "input": "Was ist der Unterschied zwischen RAG und Fine-Tuning?",
        "expected": "RAG ruft externe Dokumente ab, Fine-Tuning passt Modellgewichte an.",
    },
    {
        "input": "Welche Metriken gibt es für RAG-Evaluation?",
        "expected": "Faithfulness, Answer Relevancy, Context Precision und Context Recall.",
    },
]
 
def get_rag_response(query: str) -> tuple[str, list[str]]:
    """Dein RAG-System aufrufen."""
    # Hier dein tatsächliches RAG-System einbinden
    from your_rag_system import query_rag
    result = query_rag(query)
    return result["answer"], result["source_chunks"]
 
 
@pytest.mark.parametrize("test_data", TEST_CASES)
def test_rag_pipeline(test_data):
    actual_output, retrieval_context = get_rag_response(test_data["input"])
 
    test_case = LLMTestCase(
        input=test_data["input"],
        actual_output=actual_output,
        retrieval_context=retrieval_context,
        expected_output=test_data["expected"],
    )
 
    assert_test(test_case, [faithfulness, relevancy, precision])

Run with:

deepeval test run tests/test_rag_quality.py

CI/CD Integration

DeepEval integrates seamlessly with GitHub Actions:

# .github/workflows/rag-eval.yml
name: RAG Evaluation
 
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Python Setup
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
 
      - name: Install Dependencies
        run: pip install deepeval -r requirements.txt
 
      - name: RAG Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: deepeval test run tests/test_rag_quality.py

Every push or pull request automatically triggers the RAG evaluation. When metrics fall below defined thresholds, the build fails — just like failing unit tests.

Building Golden Datasets

The quality of your evaluation depends entirely on your test data. A Golden Dataset is a curated, versioned collection of questions with expected answers and relevant context.

Three-Tier Strategy

TierSourceEffortQuality
SilverRAGAS TestsetGeneratorLowMedium
GoldValidated by domain expertsMediumHigh
ProductionAnnotated real user queriesHighVery high

Recommendation: Start with 50 silver datasets, have them validated by a domain expert, and continuously supplement with real production queries.

Structure and Versioning

[
  {
    "query": "Was ist der Unterschied zwischen RAG und Fine-Tuning?",
    "ground_truth": "RAG ruft externe Dokumente ab und nutzt sie als Kontext. Fine-Tuning passt die Modellgewichte an.",
    "relevant_doc_ids": ["doc_042", "doc_117"],
    "question_type": "comparison",
    "difficulty": "medium",
    "annotator": "domain_expert_1",
    "annotation_date": "2026-02-15"
  }
]

Version your golden dataset alongside your code in the Git repository. This way, every commit makes it traceable which test data was used.

LLM-as-a-Judge

Why not simply use BLEU or ROUGE? These token-overlap metrics correlate poorly with human evaluation. LLM-based evaluation achieves 15 to 20% higher correlation because it understands semantic equivalence.

Custom Evaluators

Sometimes standard metrics aren't enough. For domain-specific requirements, you can build custom evaluators:

import json
from openai import AsyncOpenAI
 
client = AsyncOpenAI()
 
JUDGE_PROMPT = """Du bist ein Evaluierungsexperte für RAG-Systeme.
Bewerte die folgende Antwort auf einer Skala von 1-5.
 
Frage: {query}
Kontext: {context}
Antwort: {response}
 
Kriterien:
- Treue: Ist die Antwort durch den Kontext belegt?
- Relevanz: Beantwortet die Antwort die Frage?
- Vollständigkeit: Sind alle relevanten Informationen enthalten?
 
Antworte in JSON-Format:
{{"treue": <1-5>, "relevanz": <1-5>, "vollstaendigkeit": <1-5>, "begruendung": "<Text>"}}"""
 
 
async def llm_judge(query: str, context: str, response: str) -> dict:
    result = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                query=query, context=context, response=response
            ),
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    scores = json.loads(result.choices[0].message.content)
    return {
        "treue": scores["treue"] / 5.0,
        "relevanz": scores["relevanz"] / 5.0,
        "vollstaendigkeit": scores["vollstaendigkeit"] / 5.0,
        "begruendung": scores["begruendung"],
    }

This custom judge evaluates in German, with domain-specific criteria, and provides a traceable justification.

RAG Testing in the CI/CD Pipeline

RAG Evaluation Pipeline: From query through retrieval and generation to metric assessment

A production-ready evaluation pipeline combines all the building blocks discussed so far:

# tests/conftest.py
import pytest
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
)
 
# Quality Gates: Pipeline schlägt fehl wenn unterschritten
THRESHOLDS = {
    "faithfulness": 0.80,
    "relevancy": 0.70,
    "precision": 0.65,
}
 
@pytest.fixture(scope="session")
def rag_metrics():
    return [
        FaithfulnessMetric(threshold=THRESHOLDS["faithfulness"]),
        AnswerRelevancyMetric(threshold=THRESHOLDS["relevancy"]),
        ContextualPrecisionMetric(threshold=THRESHOLDS["precision"]),
    ]

Recommended starting thresholds:

MetricMinimumTargetCritical
Faithfulness0.800.90< 0.70
Answer Relevancy0.700.85< 0.60
Context Precision0.650.80< 0.50
Context Recall0.600.80< 0.45

Start conservatively and gradually tighten the thresholds as your system matures.

Framework Comparison

Which framework is right for you?

FrameworkStrengthIdeal For
RAGASResearch-oriented, flexible metricsCustom evaluation, research
DeepEvalpytest integration, CI/CDEngineering teams, automation
LangSmithTracing + debuggingLangChain users, observability
Arize PhoenixFramework-agnostic, OpenTelemetryHeterogeneous tech stacks
TruLensRAG Triad methodologyHallucination detection

My recommendation: RAGAS for initial evaluation, DeepEval for CI/CD. Both are open source and complement each other perfectly.

Conclusion

RAG evaluation is not an optional extra — it is the quality assurance of your AI application. With the right tools and metrics, it becomes as natural as unit tests in traditional software development.

Getting started in 3 steps:

  1. Understand the metrics: Faithfulness and Answer Relevancy are your most important indicators
  2. Build a golden dataset: 50 validated test questions are better than 500 unmaintained ones
  3. Integrate CI/CD: Automatic evaluation with every deployment

The frameworks presented here, especially RAGAS and DeepEval, make getting started easy. Both can be integrated into existing projects with just a few lines of code.

If you're building RAG systems that run in production, systematic evaluation is not optional — it's a necessity. The tools are mature, open source, and ready for use.


This article is part of my RAG series. For an introduction to RAG, I recommend Introduction to RAG and the CRAG Architecture. For advanced architectures: Agentic RAG and GraphRAG. And if you want to bring a RAG system to production: RAG in Production.


Need help with quality assurance for your RAG system? Contact me for a no-obligation consultation on evaluation and testing.