Long-form factuality in large language models

AI-generated keywords: Language Models Factuality Benchmarking SAFE LongForm

AI-generated Key Points

Introduction of LongFact:
Comprehensive set of 2,280 prompts covering 38 diverse topics
Proposal of SAFE method for evaluating long-form factuality:
Leveraging language-model agents to assess model responses
Breaking down responses into individual facts, refining them for self-containment, evaluating relevance to prompt, and verifying accuracy through Google Search queries
Use of hyperparameter K to measure preferred response length by users and combining precision and recall in F1@K for evaluation
Empirical results of SAFE:
Achieves superhuman performance with a 72% agreement with human annotations
Wins 76% of randomly sampled disagreement cases out of 100
Cost-effectiveness comparison: SAFE is more than 20 times cost-effective compared to crowdsourced human annotators
Benchmarking on thirteen models from four model families using LongFact shows that larger language models generally exhibit better long-form factuality
Future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration
For further reference and replication purposes, the code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

arXiv: 2403.18802v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

Submitted to arXiv on 27 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.18802v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this paper, we delve into the thorough benchmarking of long-form factuality in large language models. We introduce a comprehensive set of 2,280 prompts covering 38 diverse topics called LongFact and propose a method for evaluating long-form factuality known as SAFE. Our approach leverages language-model agents to automatically assess a model's response by breaking it down into individual facts, refining them for self-containment, evaluating their relevance to the prompt, and verifying their accuracy through Google Search queries. We also introduce a hyperparameter K as a measure of preferred response length by users and combine precision and recall in F1@K for evaluation. Our empirical results demonstrate that SAFE achieves superhuman performance by agreeing with human annotations 72% of the time and winning 76% of randomly sampled disagreement cases out of 100. Additionally, SAFE proves to be more than 20 times cost-effective compared to crowdsourced human annotators. Furthermore, we conduct benchmarking on thirteen models from four model families (Gemini, GPT, Claude, PaLM-2) using LongFact and observe that larger language models generally exhibit better long-form factuality. Moving forward, future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration. However,<nl> while our work focuses on factuality concerning world knowledge correctness,<nl> there remains uncertainty in reliably measuring hallucination within long-form settings.<nl> Through our benchmarking efforts,<nl> we aim to demonstrate how robust dataset acquisition methods,<nl> model evaluation techniques,<nl> and metric aggregation can significantly enhance our understanding of language model capabilities in long-form scenarios.<nl> We hope that this work will inspire further research into both assessing and enhancing language models operating in long-form domains.<br> <br> <br> <br> The code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality for reference and replication purposes.

- Introduction of LongFact:
- Comprehensive set of 2,280 prompts covering 38 diverse topics
- Proposal of SAFE method for evaluating long-form factuality:
- Leveraging language-model agents to assess model responses
- Breaking down responses into individual facts, refining them for self-containment, evaluating relevance to prompt, and verifying accuracy through Google Search queries
- Use of hyperparameter K to measure preferred response length by users and combining precision and recall in F1@K for evaluation
- Empirical results of SAFE:
- Achieves superhuman performance with a 72% agreement with human annotations
- Wins 76% of randomly sampled disagreement cases out of 100
- Cost-effectiveness comparison: SAFE is more than 20 times cost-effective compared to crowdsourced human annotators
- Benchmarking on thirteen models from four model families using LongFact shows that larger language models generally exhibit better long-form factuality
- Future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration
For further reference and replication purposes, the code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality.

SummaryLongFact is a tool with many questions on different topics. SAFE is a method to check if the answers are correct by using computer programs. It looks at each answer, makes sure it makes sense, and checks on Google. They use a number called K to see how long answers should be. SAFE works better than people and costs less. Definitions- LongFact: A tool with many questions on different topics. - SAFE method: A way to check if answers are correct using computer programs. - Language-model agents: Computer programs that understand language. - Hyperparameter K: A number used to decide how long answers should be. - F1@K: A measure combining precision and recall for evaluation. - Crowdsourced human annotators: People who check if things are right in groups. - Benchmarking: Comparing different things to see which is better or worse.

Introduction

Language models have recently gained significant attention in the field of natural language processing due to their impressive performance on various tasks such as text generation and question-answering. However, there is still a lack of understanding about how well these models can handle long-form factuality, which refers to the ability to generate accurate and relevant factual information in response to a given prompt. In this research paper, titled "Benchmarking Long-Form Factuality in Large Language Models," the authors delve into a comprehensive evaluation of long-form factuality in language models. They introduce a new dataset called LongFact, consisting of 2,280 prompts covering 38 diverse topics. The authors also propose a novel method for evaluating long-form factuality known as SAFE (Self-contained Accuracy with Google Evidence), which leverages language-model agents and Google Search queries.

The Need for Benchmarking Long-Form Factuality

While previous studies have evaluated language models' performance on short-form fact-checking tasks, there has been limited research on their abilities in generating accurate and relevant factual information in longer texts. This is crucial because many real-world applications require language models to produce coherent and reliable responses that are not limited to short answers or snippets. Moreover, existing benchmarks for assessing long-form factuality are either small-scale or focus only on specific domains, making it challenging to compare different models' performances comprehensively. Therefore, there is a need for a standardized benchmark that covers diverse topics and evaluates long-form factuality accurately.

The LongFact Dataset

To address the limitations mentioned above, the authors introduce the LongFact dataset – a comprehensive set of 2,280 prompts covering 38 diverse topics such as history, science, politics, sports etc. These prompts were carefully selected from multiple sources like Wikipedia articles and news articles using automated methods followed by manual verification by human annotators. The authors ensured that the prompts are challenging and require a deep understanding of world knowledge to generate accurate responses. They also made sure that the prompts do not contain any factual errors or biases, making LongFact a reliable benchmark for evaluating long-form factuality.

The SAFE Method

The authors propose a novel method for evaluating long-form factuality called SAFE (Self-contained Accuracy with Google Evidence). This approach leverages language-model agents to automatically assess a model's response by breaking it down into individual facts, refining them for self-containment, evaluating their relevance to the prompt, and verifying their accuracy through Google Search queries. This method addresses some of the limitations of existing evaluation methods such as relying on human annotations or only considering surface-level correctness. By using Google Search queries, SAFE can verify the accuracy of generated facts in real-world contexts and evaluate models' abilities to generate relevant information.

Evaluation Metrics

To measure long-form factuality accurately, the authors introduce a hyperparameter K as a measure of preferred response length by users. They then combine precision and recall in F1@K – an evaluation metric that considers both relevance and accuracy while accounting for different response lengths.

Empirical Results

The authors conducted benchmarking on thirteen models from four model families (Gemini, GPT, Claude, PaLM-2) using LongFact and observed that larger language models generally exhibit better long-form factuality. The results showed that SAFE achieves superhuman performance by agreeing with human annotations 72% of the time and winning 76% of randomly sampled disagreement cases out of 100. Additionally, SAFE proves to be more than 20 times cost-effective compared to crowdsourced human annotators. This highlights its potential as an efficient alternative for evaluating long-form factuality in large language models.

Future Research Avenues

The authors acknowledge that there is still room for improvement in language models' long-form factuality and suggest future research avenues. These could include exploring enhanced pretraining/finetuning methods or integrating external tools to improve language models' abilities in generating accurate and relevant factual information. However, while this work focuses on factuality concerning world knowledge correctness, there remains uncertainty in reliably measuring hallucination within long-form settings. This opens up opportunities for further research into evaluating and enhancing language models' capabilities in handling hallucinations.

Conclusion

In conclusion, the paper "Benchmarking Long-Form Factuality in Large Language Models" presents a comprehensive evaluation of long-form factuality using the LongFact dataset and the SAFE method. The results demonstrate that SAFE achieves superhuman performance and is more cost-effective compared to human annotators. Furthermore, benchmarking on various models highlights the potential for improving long-form factuality through larger language models. Through their work, the authors aim to inspire further research into assessing and enhancing language models operating in long-form domains. They also emphasize the importance of robust dataset acquisition methods, model evaluation techniques, and metric aggregation in understanding language model capabilities accurately. The code for LongFact and SAFE is publicly accessible for reference and replication purposes, encouraging future studies to build upon this work.

Acknowledgements

The authors would like to thank all human annotators who contributed to creating the LongFact dataset as well as Google DeepMind's support for this project. They also express their gratitude towards all researchers working towards advancing natural language processing technologies.

Created on 01 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.