Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Abstract
Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework (2024)
- PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain (2024)
- Improving Retrieval Augmented Language Model with Self-Reasoning (2024)
- Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning (2024)
- Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Unbounded context with memory
The benchmark didn't come with an evaluation script, so we first implemented that in optillm - https://github.com/codelion/optillm/blob/main/scripts/eval_frames_benchmark.py
I had implemented a memory plugin (https://github.com/codelion/optillm/blob/main/optillm/plugins/memory_plugin.py) in optillm for adding short-term memory and unbounded context to LLMs. We used FRAMES to evaluate the memory plugin with Gemma2 model from Google. Gemma2 has a context window of 8192 so, in the paper when Google reported the results they only reported it for naive prompt which doesn't include the text retrieved via RAG.
However, by using the memory plugin in optillm we can make the context of any LLM to be unbounded. We managed to boost the accuracy to 30.1% v/s 5.1% as reported by Google in the paper.
Also, we were able to get almost the same accuracy as Gemini with just gpt-4o-mini using optillm memory even though gpt-4o-mini has a context window that is 1/10 that of Gemini.
Models citing this paper 0
No model linking this paper