How we are doing RAG AI evaluation in Atlas

Written by Atlas Chief Architect | Apr 17, 2025 8:00:00 AM

As AI adoption continues to accelerate, organizations are increasingly relying on Large Language Models (LLMs) to power intelligent applications—from chatbots to copilots to knowledge assistants. But building an AI system is just the beginning. One of the most critical, yet often overlooked, aspects of the AI development lifecycle is evaluation: understanding how well your system performs, identifying failure modes, and making informed decisions about models, prompts, and architecture.

This becomes even more crucial when working with Retrieval-Augmented Generation (RAG) systems, where a language model is combined with an external knowledge source to improve relevance and reduce hallucinations. RAG systems introduce new variables—such as document retrieval quality and grounding consistency—that make traditional evaluation methods insufficient.

In this post, we’ll walk through how to evaluate AI systems, particularly RAG-based architectures, using the .NET ecosystem and the Microsoft.Extensions.AI.Evaluation library. We'll cover evaluation criteria, challenges unique to RAG systems, and how to build a structured, repeatable evaluation pipeline to gain deep insights into your AI’s behavior.

What do we mean by "evaluation"?

When we talk about evaluating AI systems, we’re not just referring to precision, recall, or BLEU scores. Those traditional metrics—while still useful in certain contexts—fail to capture the full complexity of interactions with modern LLMs, especially in retrieval-augmented generation (RAG) systems.

Evaluation in this context is about qualitatively assessing how well the AI system meets user expectations. It’s about answering questions like:

Does the response actually help the user?
Is the information accurate, complete, and well-structured?
Could a different model or prompt improve the outcome?
Is the retrieval component surfacing the right knowledge to support the answer?

These questions go beyond pure model performance and into the realm of end-to-end system quality.

For example, you might have a great LLM, but if the retrieved documents aren’t relevant, the output will still be poor. Or you might get a factually correct answer that’s hard to read, incoherent, or misleading because of subtle phrasing issues.

That’s why evaluation must be holistic, considering not just what the model outputs, but also how it got there—and whether it makes sense from a user and business perspective.

In practice, this means defining a clear set of evaluation criteria, applying them consistently, and analyzing the results across different configurations, models, or prompts. It’s not just about getting a score—it’s about building confidence in your AI system.

Common evaluation criteria in AI systems

To evaluate LLM-based systems effectively, especially those using RAG, we need to go beyond accuracy and look at qualitative dimensions of the responses.

Here are the key criteria we use:

Relevance
Is the response useful and on-topic given the input question?
Example: You ask about Azure OpenAI, but the response is about AWS SageMaker. → Poor relevance.
Truthfulness
Are the facts in the response correct?
Example: The model gives the wrong release date for .NET 8. → Poor truth.
Completeness
Does the response cover all important aspects of the question?
Example: A deployment guide that skips key steps. → Poor completeness.
Fluency
Is the language grammatically correct and easy to understand?
Example: “Model train Azure upload then works.” → Poor fluency.
Coherence
Do the ideas make sense together and logically flow?
Example: A response that contradicts itself. → Poor coherence.
Equivalence
Do two different responses express the same meaning, even if phrased differently?
Useful when comparing model variants or evaluating paraphrasing.
Groundedness
Is the response based on verifiable information (e.g., retrieved documents), or is it hallucinating?
This is especially important in RAG, where grounding is a key goal.

Each of these dimensions offers a different lens for understanding system quality, and many evaluation tools—including the one we’ll explore later—support them natively.

Evaluation challenges in RAG systems

While evaluation is essential for all AI systems, RAG introduces unique complexities:

The dual pipeline problem
In RAG, you’re not just evaluating a model—you’re evaluating a pipeline. A poor response could be due to the model, the retriever, or both.
Hallucinated grounding
Sometimes the AI fabricates content and falsely attributes it to retrieved documents. Detecting this requires manual review or custom evaluators.
Retrieval quality is hard to score
A retrieved document may look relevant to a search engine but be useless in context. Measuring retrieval contribution remains a gray area.
Cost-performance tradeoffs
Evaluating different configurations (e.g., models, temperatures, prompt styles) is necessary to optimize cost without degrading user experience.
Semantic similarity ≠ correctness
A response might "sound good" while being wrong. That's why grounding and truth are separate axes in our evaluations.

In short: evaluation in RAG systems must go deeper than “was this a good answer?” It must explore why the answer was good (or bad), and what part of the system is responsible.

Creating a high-quality evaluation dataset

The quality of your evaluation is only as good as your dataset.

In our case, we manually crafted a dataset of queries and expected answers with the help of domain experts inside the organization. These experts deeply understand the company’s documentation, business logic, and terminology—making them ideal partners for defining what “correct” looks like.

Why manual matters:

Experts can define not just answers, but also what a good answer includes.
Real-world phrasing ensures the queries reflect what users actually ask.
We could label not only ground truth but also mark nuances (e.g., “acceptable but incomplete”).

We also explored (but haven't yet used) tools that generate synthetic evaluation datasets directly from your vector index. These can be a good starting point when manual resources are limited. Some tools worth looking into:

LlamaIndex eval
https://docs.llamaindex.ai/en/stable/examples/evaluation/QuestionGeneration/
Ragas (by ExplodingGradients)
https://docs.ragas.io/en/stable/getstarted/rag_testset_generation/#analyzing-the-testset
AutoRAG
https://docs.auto-rag.com/data_creation/tutorial.html

However, nothing beats expert judgment when your use case requires precision or has compliance implications.

Microsoft.Extensions.AI.Evaluation: Overview

Microsoft has introduced a flexible evaluation framework as part of the Microsoft.Extensions.AI package ecosystem, built specifically for .NET developers.

Microsoft.Extensions.AI.Evaluation provides tools to define evaluation scenarios, apply both built-in and custom evaluators, and generate structured reports you can analyze or visualize later.

High level architecture

Atlas evaluation tool

One of the pillars of our product strategy is not only to provide the best Intelligent Knowledge Platform with powerful AI features, but also to ensure they are easy to integrate, validate, and operate in real-world environments. That’s why we created Atlas CLI—an internal command-line tool designed to simplify and automate various operations across our Atlas Fuse platform.

Atlas CLI is envisioned as the go-to interface for developers, QA engineers, and DevOps teams who need to interact programmatically with Atlas. Whether it’s checking the health status of the platform, triggering evaluation runs, or—soon—managing knowledge collections, uploading datasets, or initiating training workflows, this CLI is a key component for streamlined and professional AI operations.

Currently, one of the most powerful commands is evaluate, which enables automatic evaluation of a dataset of questions and expected answers. It compares ground-truth expectations with actual outputs from Atlas Fuse, and generates multiple outputs:

A CSV report with detailed scores per question, across multiple dimensions (Equivalence, Groundedness, Relevance, Truthfulness, etc.), perfect for QA teams or integration with BI tools.
An interactive HTML report, showing each question, the generated answer, the supporting facts (context), and all scoring dimensions in one place.
Aggregated metrics and visual insights (in development), to help assess system-wide performance at a glance.

All of this can be triggered from a single command:

atlas fuse evaluate ./data/ground_truth.json -k <KnowledgeCollectionId> -o ./report -d

Lessons learned from real-world usage

Through applying this framework internally, we’ve learned:

Evaluation brings clarity
We now have a consistent way to talk about system quality beyond "it feels okay".
Ground truth is everything
A well-built dataset makes or breaks the evaluation process. Invest in this early.
Built-in criteria help, but custom evaluators are gold
Tailoring evaluators to your domain (e.g., compliance, tone of voice) unlocks deeper insights.
Don't underestimate fluency and coherence
Even factually correct answers can fail if they confuse or frustrate users.
Automate early
Even partial automation of evaluation helps you scale iteration and A/B testing.

Trustworthy RAG AI needs evaluation

In the race to build AI-powered apps, it’s tempting to jump from prototype to production. But without rigorous evaluation, you’re flying blind. For systems that use LLMs—and especially RAG—quality is nuanced and multi-dimensional.

By applying structured evaluation with tools like Microsoft.Extensions.AI.Evaluation, and grounding your process in expert-labeled datasets, you can build AI systems that are not only powerful, but trustworthy.

Evaluation isn't a final checkbox—it’s an ongoing process that enables better architecture, better decisions, and ultimately, better user experiences.

View full post