Large Language Models (LLMs) have made significant progress in natural language processing (NLP) tasks, particularly in question answering (QA). Despite this progress, LLMs still struggle with hallucination - generating answers that lack factual accuracy or grounding.
Retrieval-Augmented Generation (RAG) is a promising solution to address LLM’s knowledge limitations. RAG systems search external sources, like web pages or knowledge graphs, for relevant information before generating answers. This approach helps ground the answers in real-world data, reducing hallucination.
While RAG holds immense potential, it faces various challenges. These include:
Selecting the most relevant information from retrieved sources can be challenging, especially when dealing with a large amount of noisy data.
Retrieving information and generating answers can be slow, which is problematic for real-time applications.
Answering complex questions often requires synthesizing information from multiple sources. Doing this efficiently and accurately is a key challenge.
The paper addresses the need for a comprehensive benchmark to advance RAG research. The authors note that existing QA benchmarks like Natural Questions [12] or TriviaQA [10] don’t sufficiently represent the diverse and dynamic challenges that RAG systems face.
They argue that a good benchmark should be:
The paper introduces the Comprehensive RAG Benchmark (CRAG), which aims to be a robust and versatile benchmark for testing RAG systems and general QA systems. The benchmark focuses on providing a shared testbed to evaluate how these systems handle real-world, dynamic, and diverse information retrieval and synthesis challenges for reliable LLM-based question answering.
CRAG surpasses other benchmarks in several aspects:
The authors frame the task as:
The answer should:
CRAG defines three tasks for evaluating RAG systems:
By introducing three tasks, each progressively increasing in complexity, CRAG provides a comprehensive evaluation of RAG systems, testing their capabilities in different areas.
The CRAG dataset comprises two key parts:
Domains
Finance, Sports, Music, Movie, and Open domain.
Question Types
Dynamism
The paper classifies questions based on the frequency with which their answers change:
Entity popularity
The paper samples entities with different popularity levels (head, torso, and tail) to evaluate how well RAG systems handle information about less popular subjects.
Data generation
Web search results
For each question, the authors used the Brave Search API to retrieve up to 50 HTML pages. This simulates real-world web search results.
Mock KGs
The authors created mock KGs containing publicly available data, randomly selected entities, and “hard negative” entities with similar names.
Mock APIs
The authors created mock APIs with pre-defined parameters to support structured searches in the mock KGs.
The paper proposes two main metrics for evaluating RAG systems:
Scoreh
This metric assigns scores to each answer based on four categories:
The system's final score is the average Scoreh across all questions in the evaluation set. This metric penalizes hallucinations and prioritizes missing answers over incorrect ones.
The paper employs both human evaluation and model-based automatic evaluation:
Human evaluation
Manual grading to judge each answer as perfect, acceptable, missing, or incorrect.
Automatic evaluation:
Two-step method:
Scorea
This metric uses a three-way scoring system:
The final Scorea is the average score across all questions in the evaluation set. This metric is equivalent to Accuracy - Hallucination.
The paper reports the average accurate, hallucination, missing rates, and Scorea for each RAG system using both ChatGPT and Llama 3 as evaluators.
The authors evaluated two types of RAG systems on CRAG:
Experiment setup
Evaluations
The authors evaluated the performance of LLM-only solutions and RAG solutions across different dimensions, including domain, dynamism, entity popularity, and question type.
Performance
Experiment setup
Evaluations
The authors evaluated the performance of SOTA systems across different dimensions, including domain, dynamism, entity popularity, and question type.
Performance
CRAG has significant implications for businesses using LLMs for question answering and knowledge-based applications. Here are some key aspects:
The benchmark helps businesses develop and evaluate RAG systems that are more reliable and accurate, leading to increased user trust.
By improving the accuracy and efficiency of RAG systems, businesses can provide better customer experiences, particularly in applications like customer support chatbots or search engines.
CRAG guides businesses in identifying specific areas for research and development to address the limitations of current RAG solutions, ultimately leading to more sophisticated and effective systems.
The benchmark helps businesses make data-driven decisions about the best RAG solutions for their needs, considering factors like accuracy, latency, and cost.
Businesses that leverage CRAG to develop and optimize their RAG systems can gain a competitive advantage by providing more accurate and reliable knowledge-based services to their customers.
CRAG is a significant step forward in benchmarking RAG systems. It offers several advantages over existing benchmarks:
Realistic and diverse scenarios
CRAG simulates real-world scenarios, including web search, knowledge graph search, and dynamic question types.
Comprehensiveness
CRAG covers a wide range of domains and question types, providing a more thorough evaluation of RAG systems.
Insights into system limitations
CRAG identifies specific areas where current RAG solutions struggle, highlighting research directions for improvement.
CRAG provides a valuable tool for research and development in RAG, helping researchers and practitioners to create more robust and reliable systems for knowledge-based applications. The paper highlights the ongoing challenge of ensuring that RAG systems effectively handle noisy information retrieved from external sources, a critical aspect for trustworthy AI.
CRAG captures the following elements that are often missing in current solutions:
Existing benchmarks often focus on static, single-hop questions from a limited set of domains. CRAG simulates real-world scenarios with dynamic questions, diverse domains, and noisy data, making it more challenging and relevant for evaluating RAG systems.
Most benchmarks lack questions about real-time or rapidly changing information. CRAG incorporates questions with varying levels of dynamism, enabling evaluation of systems that can handle information that changes over time.
Benchmarks often focus on questions about popular entities. CRAG includes questions about less popular ("tail") entities, highlighting the need for RAG systems to effectively retrieve and synthesize information about lesser-known subjects.
Existing benchmarks often don’t provide a realistic simulation of knowledge graph search. CRAG's mock APIs enable evaluation of RAG systems' ability to query structured knowledge bases and integrate them with web search results.
By incorporating these elements, CRAG provides a more accurate and comprehensive picture of RAG systems' capabilities, enabling more effective research and development efforts. This comprehensive approach helps ensure that RAG systems can meet the demands of real-world applications, where information is often dynamic, diverse, and complex.