RAG Retrieval Evaluation Metrics: Complete Guide to Recall, MRR, and NDCG Calculation

Last updated: January 2026 | Reading time: 15 minutes | Difficulty: Intermediate to Advanced

The Error That Started Everything

Last Tuesday, our production RAG system returned catastrophic results. Users complained that answers felt random—"Why did the system recommend the wrong policy document?" The error log showed nothing. No exceptions, no timeouts. Just silently poor retrieval. After three hours of debugging, I realized our evaluation code was fundamentally broken: we were measuring accuracy on our retrieved results but never actually validating whether the right documents were in the candidate set.

The culprit? We never computed Recall@K properly. Our benchmark claimed 94% accuracy while our production users experienced something closer to 60%. This tutorial will save you from making the same mistake.

Why RAG Retrieval Metrics Matter

In Retrieval-Augmented Generation systems, the quality of your answers is bounded by the quality of your retrieval. You can have the most powerful language model in the world, but if it receives irrelevant context, output quality collapses. This is why measuring retrieval performance is non-negotiable for production RAG systems.

The three fundamental metrics every RAG engineer must understand are:

Recall@K — Did we find all relevant documents?
Mean Reciprocal Rank (MRR) — How early in the ranking did we find the first relevant document?
Normalized Discounted Cumulative Gain (NDCG@K) — How well did we rank documents by relevance?

Understanding Recall@K

Recall measures the fraction of relevant documents that were successfully retrieved within the top-K results. In other words: "Did we find everything we needed?"

Formula:

Recall@K = |{relevant documents} ∩ {retrieved top-K documents}| / |{relevant documents}|

A Recall@5 of 0.8 means that 80% of all relevant documents appeared in your top-5 results. This metric is crucial when missing a document has severe consequences—like medical, legal, or financial retrieval systems.

Understanding MRR (Mean Reciprocal Rank)

MRR evaluates where the first relevant document appears in your ranking. It's particularly useful when you only need one good answer—think of a question-answering system where the top result is what the user sees first.

Formula:

RR (for one query) = 1 / rank_of_first_relevant_document

MRR = (1/N) * Σ RR_i for all N queries

If the first relevant document appears at position 1, RR = 1. At position 3, RR = 0.333. At position 10, RR = 0.1. MRR gives you a single number summarizing first-result quality across all queries.

Understanding NDCG@K

NDCG handles graded relevance—documents that are "somewhat relevant" score differently from "highly relevant." It's the gold standard for ranking evaluation because it penalizes placing highly-relevant documents deep in the results.

Formula:

DCG@K = Σ (rel_i / log2(i + 1)) for i = 1 to K

IDCG@K = DCG@K calculated with perfectly ranked relevant documents

NDCG@K = DCG@K / IDCG@K

NDCG@K always ranges from 0 to 1, where 1 means perfect ranking. I've found this metric catches ranking quality issues that Recall completely misses.

Implementation: Computing All Three Metrics

Let's build a complete evaluation pipeline. I'll use the HolySheep AI API for embeddings—it's dramatically cheaper than OpenAI ($0.42/MToken for DeepSeek V3.2 vs GPT-4.1's $8) and provides sub-50ms latency on embeddings.

import numpy as np
from typing import List, Dict, Tuple
import requests

HolySheep AI Configuration
HOLYSHEEP_API_KEY = "YOUR_HOLYSHEEP_API_KEY"
HOLYSHEEP_BASE_URL = "https://api.holysheep.ai/v1"

class RAGEvaluator:
    """Comprehensive RAG retrieval evaluation suite."""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        })
    
    def get_embedding(self, text: str, model: str = "embedding-3") -> List[float]:
        """Fetch embedding from HolySheep API with <50ms latency."""
        response = self.session.post(
            f"{HOLYSHEEP_BASE_URL}/embeddings",
            json={
                "model": model,
                "input": text
            }
        )
        response.raise_for_status()
        return response.json()["data"][0]["embedding"]
    
    def compute_recall_at_k(
        self,
        retrieved_ids: List[str],
        relevant_ids: set,
        k: int
    ) -> float:
        """
        Calculate Recall@K.
        
        Args:
            retrieved_ids: Document IDs in ranked order (top to bottom)
            relevant_ids: Set of all relevant document IDs for this query
            k: Number of top results to evaluate
        
        Returns:
            Recall@K score (0.0 to 1.0)
        """
        if not relevant_ids:
            return 0.0
        
        retrieved_k = set(retrieved_ids[:k])
        true_positives = len(retrieved_k & relevant_ids)
        
        return true_positives / len(relevant_ids)
    
    def compute_mrr(self, retrieved_ids: List[str], relevant_ids: set) -> float:
        """
        Calculate Mean Reciprocal Rank.
        Returns 0 if no relevant document found.
        """
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                return 1.0 / rank
        return 0.0
    
    def compute_dcg(self, relevances: List[float]) -> float:
        """Compute DCG for a list of relevance scores."""
        dcg = 0.0
        for i, rel in enumerate(relevances, start=1):
            dcg += rel / np.log2(i + 1)
        return dcg
    
    def compute_ndcg_at_k(
        self,
        retrieved_ids: List[str],
        relevance_map: Dict[str, float],
        k: int
    ) -> float:
        """
        Calculate NDCG@K with graded relevance support.
        
        Args:
            retrieved_ids: Document IDs in ranked order
            relevance_map: Dictionary mapping doc_id to relevance score (0.0 to 1.0+)
            k: Evaluation depth
        
        Returns:
            NDCG@K score (0.0 to 1.0)
        """
        # Get relevance scores for retrieved documents
        retrieved_k = retrieved_ids[:k]
        relevances = [relevance_map.get(doc_id, 0.0) for doc_id in retrieved_k]
        
        # Compute DCG
        dcg = self.compute_dcg(relevances)
        
        # Compute ideal DCG (perfect ranking)
        ideal_relevances = sorted(relevance_map.values(), reverse=True)[:k]
        idcg = self.compute_dcg(ideal_relevances)
        
        # Avoid division by zero
        if idcg == 0:
            return 0.0
        
        return dcg / idcg

Initialize evaluator
evaluator = RAGEvaluator(HOLYSHEEP_API_KEY)
print("RAG Evaluator initialized successfully with HolySheep API")

Practical Benchmarking Pipeline

Now let's create a complete benchmarking system that evaluates retrieval performance across a dataset. I use this exact setup to evaluate our production systems before deployment.

import json
from dataclasses import dataclass
from collections import defaultdict

@dataclass
class EvaluationResult:
    """Container for evaluation metrics."""
    query: str
    recall_at_1: float
    recall_at_5: float
    recall_at_10: float
    mrr: float
    ndcg_at_5: float
    ndcg_at_10: float

class RetrievalBenchmark:
    """Production-ready retrieval benchmark system."""
    
    def __init__(self, evaluator: RAGEvaluator):
        self.evaluator = evaluator
        self.results: List[EvaluationResult] = []
    
    def cosine_similarity(self, a: List[float], b: List[float]) -> float:
        """Compute cosine similarity between two vectors."""
        dot_product = sum(x * y for x, y in zip(a, b))
        norm_a = np.sqrt(sum(x * x for x in a))
        norm_b = np.sqrt(sum(x * x for x in b))
        return dot_product / (norm_a * norm_b + 1e-8)
    
    def retrieve_documents(
        self,
        query: str,
        corpus: Dict[str, str],
        top_k: int = 10
    ) -> List[Tuple[str, float]]:
        """
        Retrieve relevant documents for a query using semantic similarity.
        Returns list of (doc_id, similarity_score) tuples.
        """
        query_embedding = self.evaluator.get_embedding(query)
        
        similarities = []
        for doc_id, doc_text in corpus.items():
            doc_embedding = self.evaluator.get_embedding(doc_text)
            similarity = self.cosine_similarity(query_embedding, doc_embedding)
            similarities.append((doc_id, similarity))
        
        # Sort by similarity descending
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]
    
    def evaluate_single_query(
        self,
        query: str,
        corpus: Dict[str, str],
        relevant_ids: set,
        relevance_map: Dict[str, float]
    ) -> EvaluationResult:
        """Evaluate retrieval for a single query."""
        
        # Retrieve top-10 documents
        retrieved = self.retrieve_documents(query, corpus, top_k=10)
        retrieved_ids = [doc_id for doc_id, _ in retrieved]
        
        return EvaluationResult(
            query=query,
            recall_at_1=self.evaluator.compute_recall_at_k(retrieved_ids, relevant_ids, 1),
            recall_at_5=self.evaluator.compute_recall_at_k(retrieved_ids, relevant_ids, 5),
            recall_at_10=self.evaluator.compute_recall_at_k(retrieved_ids, relevant_ids, 10),
            mrr=self.evaluator.compute_mrr(retrieved_ids, relevant_ids),
            ndcg_at_5=self.evaluator.compute_ndcg_at_k(retrieved_ids, relevance_map, 5),
            ndcg_at_10=self.evaluator.compute_ndcg_at_k(retrieved_ids, relevance_map, 10)
        )
    
    def run_benchmark(
        self,
        test_queries: List[Dict]
    ) -> Dict:
        """
        Run complete benchmark on test dataset.
        
        Expected format for each test query:
        {
            "query": "user question",
            "relevant_ids": ["doc1", "doc2"],
            "relevance_map": {"doc1": 1.0, "doc2": 0.7}
        }
        """
        all_results = []
        
        for test_case in test_queries:
            result = self.evaluate_single_query(
                query=test_case["query"],
                corpus=test_case["corpus"],
                relevant_ids=set(test_case["relevant_ids"]),
                relevance_map=test_case["relevance_map"]
            )
            all_results.append(result)
        
        # Aggregate metrics
        metrics = {
            "total_queries": len(all_results),
            "avg_recall_at_1": np.mean([r.recall_at_1 for r in all_results]),
            "avg_recall_at_5": np.mean([r.recall_at_5 for r in all_results]),
            "avg_recall_at_10": np.mean([r.recall_at_10 for r in all_results]),
            "avg_mrr": np.mean([r.mrr for r in all_results]),
            "avg_ndcg_at_5": np.mean([r.ndcg_at_5 for r in all_results]),
            "avg_ndcg_at_10": np.mean([r.ndcg_at_10 for r in all_results])
        }
        
        self.results = all_results
        return metrics

Example usage with synthetic dataset
benchmark = RetrievalBenchmark(evaluator)

sample_corpus = {
    "doc_001": "Insurance claim procedure requires form CF-42 and photo ID.",
    "doc_002": "Dental coverage includes preventive care at 100% reimbursement.",
    "doc_003": "Policy renewal deadlines are 30 days before expiration date.",
    "doc_004": "Emergency room visits have a $500 deductible plus 20% coinsurance.",
    "doc_005": "Prescription drug coverage tier system with generic discounts."
}

test_queries = [
    {
        "query": "What documents do I need for insurance claims?",
        "corpus": sample_corpus,
        "relevant_ids": ["doc_001"],
        "relevance_map": {"doc_001": 1.0}
    },
    {
        "query": "How much do emergency visits cost?",
        "corpus": sample_corpus,
        "relevant_ids": ["doc_004"],
        "relevance_map": {"doc_004": 1.0, "doc_001": 0.3}
    }
]

results = benchmark.run_benchmark(test_queries)
print(json.dumps(results, indent=2))

Production-Grade Metrics Dashboard

For continuous monitoring, I built a dashboard that tracks these metrics over time. This helped us catch a 15% Recall degradation last month before it impacted users.

import time
from datetime import datetime
import statistics

class MetricsDashboard:
    """Real-time metrics tracking for production RAG systems."""
    
    def __init__(self):
        self.history: List[Dict] = []
        self.window_size = 100  # Rolling window for metrics
    
    def record_run(self, metrics: Dict) -> None:
        """Record a benchmark run with timestamp."""
        entry = {
            "timestamp": datetime.now().isoformat(),
            **metrics
        }
        self.history.append(entry)
        
        # Keep only recent history
        if len(self.history) > self.window_size:
            self.history = self.history[-self.window_size:]
    
    def get_trend(self, metric_name: str) -> Dict:
        """Calculate trend statistics for a metric."""
        values = [h[metric_name] for h in self.history if metric_name in h]
        
        if len(values) < 2:
            return {"error": "Insufficient data for trend analysis"}
        
        return {
            "current": values[-1],
            "mean": statistics.mean(values),
            "median": statistics.median(values),
            "stdev": statistics.stdev(values) if len(values) > 1 else 0,
            "min": min(values),
            "max": max(values),
            "trend": "improving" if values[-1] > values[0] else "degrading",
            "delta_pct": ((values[-1] - values[0]) / values[0] * 100) if values[0] != 0 else 0
        }
    
    def check_thresholds(self, metrics: Dict, thresholds: Dict) -> List[str]:
        """Alert if metrics fall below acceptable thresholds."""
        alerts = []
        
        for metric, threshold in thresholds.items():
            if metric in metrics and metrics[metric] < threshold:
                alerts.append(
                    f"ALERT: {metric} = {metrics[metric]:.4f} < {threshold} (threshold)"
                )
        
        return alerts
    
    def generate_report(self) -> str:
        """Generate human-readable metrics report."""
        if not self.history:
            return "No metrics recorded yet."
        
        report_lines = [
            "=" * 60,
            "RAG RETRIEVAL METRICS REPORT",
            "=" * 60,
            f"Generated: {datetime.now().isoformat()}",
            f"Total runs tracked: {len(self.history)}",
            "",
            "AGGREGATE STATISTICS:",
            "-" * 40
        ]
        
        for metric in ["avg_recall_at_5", "avg_mrr", "avg_ndcg_at_5"]:
            trend = self.get_trend(metric)
            report_lines.append(f"{metric}:")
            report_lines.append(f"  Current: {trend['current']:.4f}")
            report_lines.append(f"  Mean: {trend['mean']:.4f}")
            report_lines.append(f"  Trend: {trend['trend']} ({trend['delta_pct']:+.2f}%)")
            report_lines.append("")
        
        return "\n".join(report_lines)

Usage example
dashboard = MetricsDashboard()

Define production thresholds
thresholds = {
    "avg_recall_at_5": 0.85,
    "avg_mrr": 0.70,
    "avg_ndcg_at_5": 0.75
}

Record sample runs
for _ in range(5):
    sample_metrics = {
        "avg_recall_at_5": np.random.uniform(0.80, 0.95),
        "avg_mrr": np.random.uniform(0.65, 0.85),
        "avg_ndcg_at_5": np.random.uniform(0.70, 0.90)
    }
    dashboard.record_run(sample_metrics)

Check for alerts
alerts = dashboard.check_thresholds(dashboard.history[-1], thresholds)
if alerts:
    print("PRODUCTION ALERTS:")
    for alert in alerts:
        print(f"  - {alert}")

print(dashboard.generate_report())

Common Errors and Fixes

Error 1: TypeError - 'NoneType' object is not iterable

Symptom: After upgrading your embedding model, you get TypeError: 'NoneType' object is not iterable when computing Recall.

Cause: The HolySheep API returned null for an embedding (likely due to empty input or rate limiting).

# BROKEN CODE
query_embedding = self.evaluator.get_embedding(query)
If API fails silently, query_embedding might be None

WRONG: No validation before using
similarities.append((doc_id, self.cosine_similarity(query_embedding, doc_embedding)))

CORRECTED CODE
def get_embedding_safe(self, text: str, max_retries: int = 3) -> List[float]:
    """Fetch embedding with error handling and retries."""
    for attempt in range(max_retries):
        try:
            embedding = self.get_embedding(text)
            if embedding is None:
                raise ValueError(f"Empty embedding returned for text: {text[:50]}")
            return embedding
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise RuntimeError(f"Failed to get embedding after {max_retries} attempts: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff
    
    raise RuntimeError("Unexpected exit from retry loop")

Error 2: ZeroDivisionError when computing NDCG

Symptom: ZeroDivisionError: float division by zero when evaluating queries with no relevant documents.

Cause: Your test dataset has queries with empty relevant_ids sets, causing division by zero in the recall calculation.

# BROKEN CODE
def compute_recall_at_k(self, retrieved_ids, relevant_ids, k):
    return len(retrieved_k & relevant_ids) / len(relevant_ids)  # Crashes if relevant_ids is empty

CORRECTED CODE
def compute_recall_at_k(self, retrieved_ids, relevant_ids, k):
    """
    Calculate Recall@K with graceful handling of edge cases.
    """
    # Handle empty relevant set
    if not relevant_ids:
        # With no relevant documents, we technically achieved perfect recall
        # (we retrieved 0 out of 0 = 100%, but this is undefined)
        # Return NaN to indicate undefined metric, or 1.0 if you prefer
        return float('nan')  # More accurate than silently returning 0
    
    retrieved_k = set(retrieved_ids[:k])
    true_positives = len(retrieved_k & relevant_ids)
    
    return true_positives / len(relevant_ids)

Alternative: If you want a defined value
def compute_recall_at_k_safe(self, retrieved_ids, relevant_ids, k):
    if not relevant_ids:
        return 1.0 if not retrieved_ids[:k] else 0.0
    return len(set(retrieved_ids[:k]) & set(relevant_ids)) / len(relevant_ids)

Error 3: 401 Unauthorized - API Authentication Failed

Symptom: requests.exceptions.HTTPError: 401 Client Error: Unauthorized when calling the HolySheep API.

Cause: Invalid API key, expired credentials, or using a key from the wrong environment.

# BROKEN CODE
HOLYSHEEP
Related Resources
📚 AI API Tutorials
💰 View Pricing
📖 Developer Docs
🚀 Sign Up Free
Related Articles
Cohere Rerank API Migration to HolySheep AI: A Complete RAG 
RAG Hallucination Control: Citation Tracing and Answer Credi
Apple MLX Framework: Complete Guide to Running Large Languag

The Error That Started Everything

Why RAG Retrieval Metrics Matter

Understanding Recall@K

Understanding MRR (Mean Reciprocal Rank)

Understanding NDCG@K

Implementation: Computing All Three Metrics

HolySheep AI Configuration

Initialize evaluator

Practical Benchmarking Pipeline

Example usage with synthetic dataset

Production-Grade Metrics Dashboard

Usage example

Define production thresholds

Record sample runs

Check for alerts

Common Errors and Fixes

Error 1: TypeError - 'NoneType' object is not iterable

If API fails silently, query_embedding might be None

WRONG: No validation before using

CORRECTED CODE

Error 2: ZeroDivisionError when computing NDCG

CORRECTED CODE

Alternative: If you want a defined value

Error 3: 401 Unauthorized - API Authentication Failed

Related Resources

Related Articles

🔥 Try HolySheep AI