In this paper, we delve into the thorough benchmarking of long-form factuality in large language models. We introduce a comprehensive set of 2,280 prompts covering 38 diverse topics called LongFact and propose a method for evaluating long-form factuality known as SAFE. Our approach leverages language-model agents to automatically assess a model's response by breaking it down into individual facts, refining them for self-containment, evaluating their relevance to the prompt, and verifying their accuracy through Google Search queries. We also introduce a hyperparameter K as a measure of preferred response length by users and combine precision and recall in F1@K for evaluation. Our empirical results demonstrate that SAFE achieves superhuman performance by agreeing with human annotations 72% of the time and winning 76% of randomly sampled disagreement cases out of 100. Additionally, SAFE proves to be more than 20 times cost-effective compared to crowdsourced human annotators. Furthermore, we conduct benchmarking on thirteen models from four model families (Gemini, GPT, Claude, PaLM-2) using LongFact and observe that larger language models generally exhibit better long-form factuality. Moving forward, future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration. However,<nl> while our work focuses on factuality concerning world knowledge correctness,<nl> there remains uncertainty in reliably measuring hallucination within long-form settings.<nl> Through our benchmarking efforts,<nl> we aim to demonstrate how robust dataset acquisition methods,<nl> model evaluation techniques,<nl> and metric aggregation can significantly enhance our understanding of language model capabilities in long-form scenarios.<nl> We hope that this work will inspire further research into both assessing and enhancing language models operating in long-form domains.<br>
<br>
<br>
<br>
The code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality for reference and replication purposes.
- - Introduction of LongFact:
- - Comprehensive set of 2,280 prompts covering 38 diverse topics
- - Proposal of SAFE method for evaluating long-form factuality:
- - Leveraging language-model agents to assess model responses
- - Breaking down responses into individual facts, refining them for self-containment, evaluating relevance to prompt, and verifying accuracy through Google Search queries
- - Use of hyperparameter K to measure preferred response length by users and combining precision and recall in F1@K for evaluation
- - Empirical results of SAFE:
- - Achieves superhuman performance with a 72% agreement with human annotations
- - Wins 76% of randomly sampled disagreement cases out of 100
- - Cost-effectiveness comparison: SAFE is more than 20 times cost-effective compared to crowdsourced human annotators
- - Benchmarking on thirteen models from four model families using LongFact shows that larger language models generally exhibit better long-form factuality
- - Future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration
- For further reference and replication purposes, the code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality.
SummaryLongFact is a tool with many questions on different topics. SAFE is a method to check if the answers are correct by using computer programs. It looks at each answer, makes sure it makes sense, and checks on Google. They use a number called K to see how long answers should be. SAFE works better than people and costs less.
Definitions- LongFact: A tool with many questions on different topics.
- SAFE method: A way to check if answers are correct using computer programs.
- Language-model agents: Computer programs that understand language.
- Hyperparameter K: A number used to decide how long answers should be.
- F1@K: A measure combining precision and recall for evaluation.
- Crowdsourced human annotators: People who check if things are right in groups.
- Benchmarking: Comparing different things to see which is better or worse.
Introduction
Language models have recently gained significant attention in the field of natural language processing due to their impressive performance on various tasks such as text generation and question-answering. However, there is still a lack of understanding about how well these models can handle long-form factuality, which refers to the ability to generate accurate and relevant factual information in response to a given prompt.
In this research paper, titled "Benchmarking Long-Form Factuality in Large Language Models," the authors delve into a comprehensive evaluation of long-form factuality in language models. They introduce a new dataset called LongFact, consisting of 2,280 prompts covering 38 diverse topics. The authors also propose a novel method for evaluating long-form factuality known as SAFE (Self-contained Accuracy with Google Evidence), which leverages language-model agents and Google Search queries.
The Need for Benchmarking Long-Form Factuality
While previous studies have evaluated language models' performance on short-form fact-checking tasks, there has been limited research on their abilities in generating accurate and relevant factual information in longer texts. This is crucial because many real-world applications require language models to produce coherent and reliable responses that are not limited to short answers or snippets.
Moreover, existing benchmarks for assessing long-form factuality are either small-scale or focus only on specific domains, making it challenging to compare different models' performances comprehensively. Therefore, there is a need for a standardized benchmark that covers diverse topics and evaluates long-form factuality accurately.
The LongFact Dataset
To address the limitations mentioned above, the authors introduce the LongFact dataset – a comprehensive set of 2,280 prompts covering 38 diverse topics such as history, science, politics, sports etc. These prompts were carefully selected from multiple sources like Wikipedia articles and news articles using automated methods followed by manual verification by human annotators.
The authors ensured that the prompts are challenging and require a deep understanding of world knowledge to generate accurate responses. They also made sure that the prompts do not contain any factual errors or biases, making LongFact a reliable benchmark for evaluating long-form factuality.
The SAFE Method
The authors propose a novel method for evaluating long-form factuality called SAFE (Self-contained Accuracy with Google Evidence). This approach leverages language-model agents to automatically assess a model's response by breaking it down into individual facts, refining them for self-containment, evaluating their relevance to the prompt, and verifying their accuracy through Google Search queries.
This method addresses some of the limitations of existing evaluation methods such as relying on human annotations or only considering surface-level correctness. By using Google Search queries, SAFE can verify the accuracy of generated facts in real-world contexts and evaluate models' abilities to generate relevant information.
Evaluation Metrics
To measure long-form factuality accurately, the authors introduce a hyperparameter K as a measure of preferred response length by users. They then combine precision and recall in F1@K – an evaluation metric that considers both relevance and accuracy while accounting for different response lengths.
Empirical Results
The authors conducted benchmarking on thirteen models from four model families (Gemini, GPT, Claude, PaLM-2) using LongFact and observed that larger language models generally exhibit better long-form factuality. The results showed that SAFE achieves superhuman performance by agreeing with human annotations 72% of the time and winning 76% of randomly sampled disagreement cases out of 100.
Additionally, SAFE proves to be more than 20 times cost-effective compared to crowdsourced human annotators. This highlights its potential as an efficient alternative for evaluating long-form factuality in large language models.
Future Research Avenues
The authors acknowledge that there is still room for improvement in language models' long-form factuality and suggest future research avenues. These could include exploring enhanced pretraining/finetuning methods or integrating external tools to improve language models' abilities in generating accurate and relevant factual information.
However, while this work focuses on factuality concerning world knowledge correctness, there remains uncertainty in reliably measuring hallucination within long-form settings. This opens up opportunities for further research into evaluating and enhancing language models' capabilities in handling hallucinations.
Conclusion
In conclusion, the paper "Benchmarking Long-Form Factuality in Large Language Models" presents a comprehensive evaluation of long-form factuality using the LongFact dataset and the SAFE method. The results demonstrate that SAFE achieves superhuman performance and is more cost-effective compared to human annotators. Furthermore, benchmarking on various models highlights the potential for improving long-form factuality through larger language models.
Through their work, the authors aim to inspire further research into assessing and enhancing language models operating in long-form domains. They also emphasize the importance of robust dataset acquisition methods, model evaluation techniques, and metric aggregation in understanding language model capabilities accurately. The code for LongFact and SAFE is publicly accessible for reference and replication purposes, encouraging future studies to build upon this work.
Acknowledgements
The authors would like to thank all human annotators who contributed to creating the LongFact dataset as well as Google DeepMind's support for this project. They also express their gratitude towards all researchers working towards advancing natural language processing technologies.