RAG Evaluation: Metrics and Testing for AI Systems

Photo by CHUTTERSNAP on Unsplash
You've built a RAG system, the first answers look promising, but how do you know if it really works well? "Looks good" isn't enough when your system answers customer inquiries in production or delivers medical information.
In my work with RAG systems in production, I've learned: without systematic evaluation, you're flying blind. This article shows you how to test RAG systems with proven frameworks and metrics, from initial evaluation to CI/CD integration.
Why RAG Evaluation Is Different
Traditional software tests check deterministic outputs: assert calculate_tax(100) == 19. With RAG systems, this is fundamentally different:
- Non-deterministic outputs: The same query can produce different but equally correct answers
- Two sources of error: Problems can lie in retrieval or in generation
- Semantic correctness: "Berlin is the capital" and "The capital of Germany is Berlin" are both correct. A string comparison fails here
The RAG Triad
TruLens developed an elegant model for RAG quality: the RAG Triad. It checks three dimensions that together cover all sources of error:
| Dimension | Checks | Failure Case |
|---|---|---|
| Context Relevance | Are the retrieved chunks relevant to the question? | Irrelevant context → hallucination risk |
| Groundedness | Is the answer based on the retrieved context? | Fabricated facts, not supported by sources |
| Answer Relevance | Does the answer actually address the question? | Correct info, but off-topic |
When all three dimensions are positive, your RAG system is demonstrably free of hallucinations, up to the limits of your knowledge base.
The Four Core Metrics
The open-source frameworks RAGAS and DeepEval have established themselves as the standard. Both work with four core metrics:
Faithfulness
What is measured? How factually accurate is the generated answer compared to the retrieved context?
How does it work?
- The answer is broken down into individual claims
- Each claim is checked against the context (NLI-based)
- The score is the ratio of supported claims to total claims
Faithfulness = Supported Claims / Total Claims
Example:
- Context: "Einstein was born on March 14, 1879 in Germany"
- Answer A: "Einstein was born on March 14, 1879 in Germany" → 1.0 (all claims supported)
- Answer B: "Einstein was born on March 20, 1879 in Germany" → 0.5 (date not supported)
Answer Relevancy
What is measured? How relevant is the answer to the question asked?
High faithfulness alone is not enough. The answer must also address the actual question. A system could deliver factually accurate but completely irrelevant information.
Example:
- Question: "What happens if the shoes don't fit?"
- Answer A: "We offer a 30-day return policy at no extra cost." → High relevance
- Answer B: "Our shoes come in sizes 36-48." → Low relevance (doesn't answer the question)
Context Precision
What is measured? Are relevant chunks ranked higher than irrelevant ones?
This metric evaluates the quality of your retriever and reranker. An irrelevant chunk at position 1 reduces precision to ~0.5, while at position 2 it barely matters.
Context Precision@K = Σ(Precision@k × v_k) / Relevant Items in Top-K
Context Recall
What is measured? Were all relevant pieces of information retrieved?
Context Recall compares the facts in the expected answer (ground truth) with the retrieved chunks. Requires a reference answer.
| Metric | Evaluates | Requires Ground Truth |
|---|---|---|
| Faithfulness | Generation | No |
| Answer Relevancy | Generation | No |
| Context Precision | Retrieval | Yes |
| Context Recall | Retrieval | Yes |
RAGAS in Practice
RAGAS (Retrieval Augmented Generation Assessment Score) is the most widely used open-source framework for RAG evaluation. Installation:
pip install ragas langchain-openai langchain-communityFirst Evaluation Example
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import (
Faithfulness,
AnswerRelevancy,
ContextPrecision,
ContextRecall,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Evaluator-LLM und Embeddings konfigurieren
evaluator_llm = LangchainLLMWrapper(
ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
evaluator_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
# Testdaten erstellen
samples = [
SingleTurnSample(
user_input="Was ist Retrieval-Augmented Generation?",
retrieved_contexts=[
"RAG kombiniert Informationsabruf mit Textgenerierung. "
"Relevante Dokumente werden aus einer Wissensbasis abgerufen "
"und dem LLM als Kontext übergeben.",
"RAG wurde 2020 von Meta AI vorgestellt und hat sich "
"als Standard für wissensbasierte KI-Systeme etabliert.",
],
response=(
"RAG (Retrieval-Augmented Generation) ist eine Technik, "
"die Informationsabruf mit Textgenerierung kombiniert. "
"Dabei werden relevante Dokumente aus einer Wissensbasis "
"abgerufen und einem LLM als Kontext übergeben."
),
reference=(
"RAG kombiniert Dokumentenabruf mit LLM-Generierung, "
"um Antworten auf Basis externer Wissensquellen zu liefern."
),
),
]
dataset = EvaluationDataset(samples=samples)
# Evaluation durchführen
results = evaluate(
dataset=dataset,
metrics=[
Faithfulness(llm=evaluator_llm),
AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings),
ContextPrecision(llm=evaluator_llm),
ContextRecall(llm=evaluator_llm),
],
)
# Ergebnisse anzeigen
df = results.to_pandas()
print(df[[
"faithfulness",
"answer_relevancy",
"context_precision",
"context_recall",
]])Typical output:
| faithfulness | answer_relevancy | context_precision | context_recall |
|---|---|---|---|
| 1.00 | 0.92 | 1.00 | 0.85 |
Interpreting the Results
| Score Range | Rating | Action |
|---|---|---|
| 0.8 to 1.0 | Good | Continue monitoring |
| 0.6 to 0.8 | Acceptable | Optimization recommended |
| < 0.6 | Critical | Immediate action required |
Low faithfulness? → Your LLM is hallucinating. Check the prompt template, reduce the temperature, implement guardrails.
Low context precision? → Your retriever is returning irrelevant chunks. Improve embedding quality or add a reranker.
Low context recall? → Relevant information is not being found. Review your chunking strategy and embedding coverage.
Generating Synthetic Test Data
Manually writing hundreds of test questions is labor-intensive. RAGAS can generate synthetic test data directly from your documents:
from langchain_community.document_loaders import DirectoryLoader
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
# Dokumente laden
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()
# Generator konfigurieren
generator_llm = LangchainLLMWrapper(
ChatOpenAI(model="gpt-4o")
)
generator_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
generator = TestsetGenerator(
llm=generator_llm,
embedding_model=generator_embeddings,
)
# Testset aus Dokumenten generieren
testset = generator.generate_with_langchain_docs(
documents=docs,
testset_size=50,
)
# Als CSV exportieren für Versionierung
df = testset.to_pandas()
df.to_csv("tests/golden_dataset.csv", index=False)
print(f"{len(df)} Testfragen generiert")The generator creates different question types: simple factual questions, multi-hop questions (requiring multiple chunks), and reasoning questions. This produces a well-balanced test set.
DeepEval: pytest for LLMs
While RAGAS excels at ad-hoc evaluation, DeepEval shines when integrating into existing test workflows. It works like pytest, but for LLMs.
pip install deepevalDefining Test Cases
# tests/test_rag_quality.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualPrecisionMetric,
)
# Metriken mit Schwellenwerten definieren
faithfulness = FaithfulnessMetric(threshold=0.8)
relevancy = AnswerRelevancyMetric(threshold=0.7)
precision = ContextualPrecisionMetric(threshold=0.65)
# Testdaten
TEST_CASES = [
{
"input": "Was ist der Unterschied zwischen RAG und Fine-Tuning?",
"expected": "RAG ruft externe Dokumente ab, Fine-Tuning passt Modellgewichte an.",
},
{
"input": "Welche Metriken gibt es für RAG-Evaluation?",
"expected": "Faithfulness, Answer Relevancy, Context Precision und Context Recall.",
},
]
def get_rag_response(query: str) -> tuple[str, list[str]]:
"""Dein RAG-System aufrufen."""
# Hier dein tatsächliches RAG-System einbinden
from your_rag_system import query_rag
result = query_rag(query)
return result["answer"], result["source_chunks"]
@pytest.mark.parametrize("test_data", TEST_CASES)
def test_rag_pipeline(test_data):
actual_output, retrieval_context = get_rag_response(test_data["input"])
test_case = LLMTestCase(
input=test_data["input"],
actual_output=actual_output,
retrieval_context=retrieval_context,
expected_output=test_data["expected"],
)
assert_test(test_case, [faithfulness, relevancy, precision])Run with:
deepeval test run tests/test_rag_quality.pyCI/CD Integration
DeepEval integrates seamlessly with GitHub Actions:
# .github/workflows/rag-eval.yml
name: RAG Evaluation
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Python Setup
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Dependencies
run: pip install deepeval -r requirements.txt
- name: RAG Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test run tests/test_rag_quality.pyEvery push or pull request automatically triggers the RAG evaluation. When metrics fall below defined thresholds, the build fails — just like failing unit tests.
Building Golden Datasets
The quality of your evaluation depends entirely on your test data. A Golden Dataset is a curated, versioned collection of questions with expected answers and relevant context.
Three-Tier Strategy
| Tier | Source | Effort | Quality |
|---|---|---|---|
| Silver | RAGAS TestsetGenerator | Low | Medium |
| Gold | Validated by domain experts | Medium | High |
| Production | Annotated real user queries | High | Very high |
Recommendation: Start with 50 silver datasets, have them validated by a domain expert, and continuously supplement with real production queries.
Structure and Versioning
[
{
"query": "Was ist der Unterschied zwischen RAG und Fine-Tuning?",
"ground_truth": "RAG ruft externe Dokumente ab und nutzt sie als Kontext. Fine-Tuning passt die Modellgewichte an.",
"relevant_doc_ids": ["doc_042", "doc_117"],
"question_type": "comparison",
"difficulty": "medium",
"annotator": "domain_expert_1",
"annotation_date": "2026-02-15"
}
]Version your golden dataset alongside your code in the Git repository. This way, every commit makes it traceable which test data was used.
LLM-as-a-Judge
Why not simply use BLEU or ROUGE? These token-overlap metrics correlate poorly with human evaluation. LLM-based evaluation achieves 15 to 20% higher correlation because it understands semantic equivalence.
Custom Evaluators
Sometimes standard metrics aren't enough. For domain-specific requirements, you can build custom evaluators:
import json
from openai import AsyncOpenAI
client = AsyncOpenAI()
JUDGE_PROMPT = """Du bist ein Evaluierungsexperte für RAG-Systeme.
Bewerte die folgende Antwort auf einer Skala von 1-5.
Frage: {query}
Kontext: {context}
Antwort: {response}
Kriterien:
- Treue: Ist die Antwort durch den Kontext belegt?
- Relevanz: Beantwortet die Antwort die Frage?
- Vollständigkeit: Sind alle relevanten Informationen enthalten?
Antworte in JSON-Format:
{{"treue": <1-5>, "relevanz": <1-5>, "vollstaendigkeit": <1-5>, "begruendung": "<Text>"}}"""
async def llm_judge(query: str, context: str, response: str) -> dict:
result = await client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
query=query, context=context, response=response
),
}],
response_format={"type": "json_object"},
temperature=0,
)
scores = json.loads(result.choices[0].message.content)
return {
"treue": scores["treue"] / 5.0,
"relevanz": scores["relevanz"] / 5.0,
"vollstaendigkeit": scores["vollstaendigkeit"] / 5.0,
"begruendung": scores["begruendung"],
}This custom judge evaluates in German, with domain-specific criteria, and provides a traceable justification.
RAG Testing in the CI/CD Pipeline
A production-ready evaluation pipeline combines all the building blocks discussed so far:
# tests/conftest.py
import pytest
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
)
# Quality Gates: Pipeline schlägt fehl wenn unterschritten
THRESHOLDS = {
"faithfulness": 0.80,
"relevancy": 0.70,
"precision": 0.65,
}
@pytest.fixture(scope="session")
def rag_metrics():
return [
FaithfulnessMetric(threshold=THRESHOLDS["faithfulness"]),
AnswerRelevancyMetric(threshold=THRESHOLDS["relevancy"]),
ContextualPrecisionMetric(threshold=THRESHOLDS["precision"]),
]Recommended starting thresholds:
| Metric | Minimum | Target | Critical |
|---|---|---|---|
| Faithfulness | 0.80 | 0.90 | < 0.70 |
| Answer Relevancy | 0.70 | 0.85 | < 0.60 |
| Context Precision | 0.65 | 0.80 | < 0.50 |
| Context Recall | 0.60 | 0.80 | < 0.45 |
Start conservatively and gradually tighten the thresholds as your system matures.
Framework Comparison
Which framework is right for you?
| Framework | Strength | Ideal For |
|---|---|---|
| RAGAS | Research-oriented, flexible metrics | Custom evaluation, research |
| DeepEval | pytest integration, CI/CD | Engineering teams, automation |
| LangSmith | Tracing + debugging | LangChain users, observability |
| Arize Phoenix | Framework-agnostic, OpenTelemetry | Heterogeneous tech stacks |
| TruLens | RAG Triad methodology | Hallucination detection |
My recommendation: RAGAS for initial evaluation, DeepEval for CI/CD. Both are open source and complement each other perfectly.
Conclusion
RAG evaluation is not an optional extra — it is the quality assurance of your AI application. With the right tools and metrics, it becomes as natural as unit tests in traditional software development.
Getting started in 3 steps:
- Understand the metrics: Faithfulness and Answer Relevancy are your most important indicators
- Build a golden dataset: 50 validated test questions are better than 500 unmaintained ones
- Integrate CI/CD: Automatic evaluation with every deployment
The frameworks presented here, especially RAGAS and DeepEval, make getting started easy. Both can be integrated into existing projects with just a few lines of code.
If you're building RAG systems that run in production, systematic evaluation is not optional — it's a necessity. The tools are mature, open source, and ready for use.
This article is part of my RAG series. For an introduction to RAG, I recommend Introduction to RAG and the CRAG Architecture. For advanced architectures: Agentic RAG and GraphRAG. And if you want to bring a RAG system to production: RAG in Production.
Need help with quality assurance for your RAG system? Contact me for a no-obligation consultation on evaluation and testing.