CRAG - Comprehensive RAG Benchmark

Key Takeaways

The paper proposes CRAG - Comprehensive RAG Benchmark - a benchmark dataset and evaluation framework to test the performance of Retrieval-Augmented Generation (RAG) systems.
The benchmark highlights the limitations of current RAG systems, indicating a need for further research.
CRAG goes beyond existing benchmarks by simulating real-world scenarios, including web and knowledge graph searches, and incorporating dynamic question types.
The results show that CRAG provides a more realistic and challenging environment for evaluating RAG systems, identifying gaps in current solutions and guiding future research directions.

Introduction

Large Language Models (LLMs) have made significant progress in natural language processing (NLP) tasks, particularly in question answering (QA). Despite this progress, LLMs still struggle with hallucination - generating answers that lack factual accuracy or grounding.

Retrieval-Augmented Generation (RAG) is a promising solution to address LLM’s knowledge limitations. RAG systems search external sources, like web pages or knowledge graphs, for relevant information before generating answers. This approach helps ground the answers in real-world data, reducing hallucination.

While RAG holds immense potential, it faces various challenges. These include:

Information selection

‍Selecting the most relevant information from retrieved sources can be challenging, especially when dealing with a large amount of noisy data.

Latency

‍Retrieving information and generating answers can be slow, which is problematic for real-time applications.

Information synthesis

‍Answering complex questions often requires synthesizing information from multiple sources. Doing this efficiently and accurately is a key challenge.

The paper addresses the need for a comprehensive benchmark to advance RAG research. The authors note that existing QA benchmarks like Natural Questions [12] or TriviaQA [10] don’t sufficiently represent the diverse and dynamic challenges that RAG systems face.

They argue that a good benchmark should be:

Realistic: It should reflect real-world use cases, ensuring that solutions performing well on the benchmark also perform well in practice.
Rich: It should contain a diverse set of instances, including common and complex use cases, to reveal potential limitations of existing solutions.
Insightful: It should allow easy understanding of performance on different slices of the data, highlighting strengths and weaknesses of different solutions.
Reliable: The benchmark should allow for reliable assessment of metrics. This includes accurate ground truth, metrics that capture model performance, easy evaluation, and statistically significant results.
Longevity: The benchmark should be updated regularly to avoid becoming outdated and to ensure continued relevance for research.

The paper introduces the Comprehensive RAG Benchmark (CRAG), which aims to be a robust and versatile benchmark for testing RAG systems and general QA systems. The benchmark focuses on providing a shared testbed to evaluate how these systems handle real-world, dynamic, and diverse information retrieval and synthesis challenges for reliable LLM-based question answering.

CRAG surpasses other benchmarks in several aspects:

Comprehensiveness: CRAG includes a large dataset of 4,409 question-answer pairs, covering diverse domains (finance, sports, music, movie, open domain) and question types.
Realistic testing: CRAG provides mock APIs simulating web and knowledge graph searches, creating a more realistic testing environment.
Dynamic question handling: CRAG incorporates questions with varying temporal dynamisms, ranging from real-time updates to static facts.
Diverse fact popularity: CRAG includes questions about popular and less popular ("tail") entities, addressing the challenge of retrieving and understanding information about lesser-known subjects.
Beyond Wikipedia: CRAG goes beyond traditional QA benchmarks by incorporating information from a broader range of sources.

Problem Description

The authors frame the task as:

Input: A question Q
Output: An answer A that is generated by an LLM based on information retrieved from external sources or the model’s internal knowledge.

The answer should:

Provide useful information: It should directly answer the question.
Avoid hallucinations: It should not contain inaccurate or fabricated information.

CRAG defines three tasks for evaluating RAG systems:

Task 1: Retrieval Summarization

Objective: Test the answer generation capabilities of a RAG system.
Data provided: Up to five web pages for each question, which may or may not be relevant to the question.

Task 2: KG and Web Retrieval Augmentation

Objective: Test the system’s ability to query structured data sources (mock KGs) and synthesize information from multiple sources (web and KG).
Data provided: Up to five web pages and access to mock APIs for querying information from mock KGs. The mock KGs contain structured data relevant to the questions.

Task 3: End-to-end RAG

Objective: Test the system’s ability to rank and process a larger number of retrieval results, including web pages and knowledge graph information.
Data provided: Up to 50 web pages for each question, mock APIs, and mock KGs.

By introducing three tasks, each progressively increasing in complexity, CRAG provides a comprehensive evaluation of RAG systems, testing their capabilities in different areas.

Dataset Description

The CRAG dataset comprises two key parts:

Question answering pairs

Domains

‍Finance, Sports, Music, Movie, and Open domain.

Question Types

Simple: Asking for simple facts that are unlikely to change over time, such as the birth date of a person.
Simple with Condition: Asking for simple facts with specific conditions, such as the stock price on a certain date.
Set: Expecting a set of entities as the answer (e.g., “What are the continents in the Southern Hemisphere?”).
Comparison: Comparing two entities (e.g., “Who started performing earlier, Adele or Ed Sheeran?”).
Aggregation: Requiring aggregation of retrieved results (e.g., “How many Oscar awards did Meryl Streep win?”).
Multi-hop: Requiring multiple pieces of information to compose the answer (e.g., “Who acted in Ang Lee’s latest movie?”).
Post-processing heavy: Needing reasoning or processing of the retrieved information (e.g., “How many days did Thurgood Marshall serve as a Supreme Court Justice?”).
False Premise: Questions with a false preposition or assumption (e.g., “What’s the name of Taylor Swift’s rap album before she transitioned to pop?” (Taylor Swift has not released any rap album)).

Dynamism

The paper classifies questions based on the frequency with which their answers change:

Real-time: Answer changes over seconds (e.g., “What’s Costco’s stock price today?”).
Fast-changing: Answer changes daily or less frequently (e.g., “When is the Laker’s game tonight?”).
Slow-changing: Answer changes yearly or less frequently (e.g., “Who won the Grammy award last year?”).
Static: Answer never changes (e.g., “What is the birth date of a person?”).

Entity popularity

‍The paper samples entities with different popularity levels (head, torso, and tail) to evaluate how well RAG systems handle information about less popular subjects.

Data generation

KG-based questions: The authors collected entities from publicly available KGs and used question templates to generate QA pairs.
Web-based questions: Annotators were asked to write down possible questions that users might ask based on web search results.

Contents for retrieval

Web search results

‍For each question, the authors used the Brave Search API to retrieve up to 50 HTML pages. This simulates real-world web search results.

Mock KGs

‍The authors created mock KGs containing publicly available data, randomly selected entities, and “hard negative” entities with similar names.

Mock APIs

‍The authors created mock APIs with pre-defined parameters to support structured searches in the mock KGs.

Metrics and Evaluation

The paper proposes two main metrics for evaluating RAG systems:

Metrics

Score_h

_‍This metric assigns scores to each answer based on four categories:

Perfect: 1
Acceptable: 0.5
Missing: 0
Incorrect: -1

The system's final score is the average Score_h across all questions in the evaluation set. This metric penalizes hallucinations and prioritizes missing answers over incorrect ones.

Evaluation

The paper employs both human evaluation and model-based automatic evaluation:

Human evaluation

‍Manual grading to judge each answer as perfect, acceptable, missing, or incorrect.

Automatic evaluation:

Two-step method:

Step 1: If the answer matches the ground truth exactly, it's considered accurate.
Step 2: Otherwise, two LLMs (ChatGPT and Llama 3) are used to determine whether the answer is accurate, incorrect, or missing. This approach helps avoid the "self-preference problem" [18], where LLMs might favor their own generations.

Score_a

_‍This metric uses a three-way scoring system:

Accurate: 1
Incorrect: -1
Missing: 0

The final Score_a is the average score across all questions in the evaluation set. This metric is equivalent to Accuracy - Hallucination.

The paper reports the average accurate, hallucination, missing rates, and Score_a for each RAG system using both ChatGPT and Llama 3 as evaluators.

Test data split: The dataset is randomly split into validation, public test, and private test sets at 30%, 30%, and 40%, respectively. The validation and public test sets were released for the KDD Cup Challenge.

Benchmarking

The authors evaluated two types of RAG systems on CRAG:

Straightforward RAG solutions

Experiment setup

The authors ran LLM-only solutions using simple prompts to encourage brief answers and "I don’t know" responses when the confidence was low.
They employed Llama 2 Chat, Llama 3 Instruct, and GPT-4 Turbo.
For web-only RAG solutions (Task 1), a fixed-length context window (2K tokens for Llama 2 Chat and 4K for Llama 3 Instruct and GPT-4 Turbo) was used.
For KG-based solutions (Tasks 2 and 3), a fixed-length KG context window (1K tokens for Llama 2 Chat and 2K for Llama 3 Instruct and GPT-4 Turbo) was used to include the results from the mock APIs.

Evaluations

‍The authors evaluated the performance of LLM-only solutions and RAG solutions across different dimensions, including domain, dynamism, entity popularity, and question type.

Performance

LLM-only: GPT-4 Turbo achieved an accuracy of only 34%.
Straightforward RAG: Solutions obtained up to 44% accuracy. However, the scores remained low (below 20%) because of hallucinations introduced by irrelevant retrieval results.
Task 2 vs. Task 1: Task 2 scores were higher than Task 1, suggesting that KG knowledge helps improve accuracy with a similar or lower hallucination rate.
Task 3 vs. Task 2: Task 3 scores were higher than Task 2, highlighting the importance of search ranking in RAG.

State-of-the-art industry solutions

Experiment setup

The authors evaluated four industry SOTA RAG systems built upon LLMs and search engines: Copilot Pro, Gemini Advanced, ChatGPT Plus, and Perplexity.ai.
They collected responses for human grading.
Traffic weights were applied to the questions to reflect real-world user interaction data.

Evaluations

‍The authors evaluated the performance of SOTA systems across different dimensions, including domain, dynamism, entity popularity, and question type.

Performance

SOTA solutions: The best system achieved a score of 51%, showing significant improvement over straightforward RAG solutions. However, the hallucination rate remained high (17% to 25%), indicating the need for further research.
Performance across dimensions: The results confirmed that the most difficult slices of the benchmark for straightforward solutions (real-time and fast-changing queries, questions about torso and tail entities, multi-hop reasoning) remained challenging for SOTA solutions.
Latency: Latency varied significantly across systems, ranging from 2.5 seconds to 11.6 seconds. This reflects trade-offs between latency and quality in different system designs.

Business implications

CRAG has significant implications for businesses using LLMs for question answering and knowledge-based applications. Here are some key aspects:

Enhanced trust

‍The benchmark helps businesses develop and evaluate RAG systems that are more reliable and accurate, leading to increased user trust.

Improved customer experience

‍By improving the accuracy and efficiency of RAG systems, businesses can provide better customer experiences, particularly in applications like customer support chatbots or search engines.

Targeted research and development

‍CRAG guides businesses in identifying specific areas for research and development to address the limitations of current RAG solutions, ultimately leading to more sophisticated and effective systems.

Data-driven decision-making

‍The benchmark helps businesses make data-driven decisions about the best RAG solutions for their needs, considering factors like accuracy, latency, and cost.

Competitive advantage

‍Businesses that leverage CRAG to develop and optimize their RAG systems can gain a competitive advantage by providing more accurate and reliable knowledge-based services to their customers.

Conclusion

CRAG is a significant step forward in benchmarking RAG systems. It offers several advantages over existing benchmarks:

Realistic and diverse scenarios

‍CRAG simulates real-world scenarios, including web search, knowledge graph search, and dynamic question types.

Comprehensiveness

‍CRAG covers a wide range of domains and question types, providing a more thorough evaluation of RAG systems.

Insights into system limitations

‍CRAG identifies specific areas where current RAG solutions struggle, highlighting research directions for improvement.

CRAG provides a valuable tool for research and development in RAG, helping researchers and practitioners to create more robust and reliable systems for knowledge-based applications. The paper highlights the ongoing challenge of ensuring that RAG systems effectively handle noisy information retrieved from external sources, a critical aspect for trustworthy AI.

Why is CRAG more effective?

CRAG captures the following elements that are often missing in current solutions:

Real-world complexity

‍Existing benchmarks often focus on static, single-hop questions from a limited set of domains. CRAG simulates real-world scenarios with dynamic questions, diverse domains, and noisy data, making it more challenging and relevant for evaluating RAG systems.

Dynamic information

‍Most benchmarks lack questions about real-time or rapidly changing information. CRAG incorporates questions with varying levels of dynamism, enabling evaluation of systems that can handle information that changes over time.

Entity popularity

‍Benchmarks often focus on questions about popular entities. CRAG includes questions about less popular ("tail") entities, highlighting the need for RAG systems to effectively retrieve and synthesize information about lesser-known subjects.

Mock APIs for knowledge graph search

‍Existing benchmarks often don’t provide a realistic simulation of knowledge graph search. CRAG's mock APIs enable evaluation of RAG systems' ability to query structured knowledge bases and integrate them with web search results.

By incorporating these elements, CRAG provides a more accurate and comprehensive picture of RAG systems' capabilities, enabling more effective research and development efforts. This comprehensive approach helps ensure that RAG systems can meet the demands of real-world applications, where information is often dynamic, diverse, and complex.

Share this post

CRAG - Comprehensive RAG Benchmark

Key Takeaways

Introduction

Information selection

Latency

Information synthesis

Problem Description

Task 1: Retrieval Summarization

Task 2: KG and Web Retrieval Augmentation

Task 3: End-to-end RAG

Dataset Description

Question answering pairs

Contents for retrieval

Metrics and Evaluation

Metrics

Evaluation

Benchmarking

Straightforward RAG solutions

State-of-the-art industry solutions

Business implications

Enhanced trust

Improved customer experience

Targeted research and development

Data-driven decision-making

Competitive advantage

Conclusion

Why is CRAG more effective?

Real-world complexity

Dynamic information

Entity popularity

Mock APIs for knowledge graph search

Why Clio AI?

Simple Out-of-the-box solution

Grounded in your company and your people

Turnkey enterprise grade offering

Spend time thinking not searching. Get a demo today.