Long-form factuality in large language models

AI-generated keywords: Language Models Factuality Benchmarking SAFE LongForm

AI-generated Key Points

  • Introduction of LongFact:
  • Comprehensive set of 2,280 prompts covering 38 diverse topics
  • Proposal of SAFE method for evaluating long-form factuality:
  • Leveraging language-model agents to assess model responses
  • Breaking down responses into individual facts, refining them for self-containment, evaluating relevance to prompt, and verifying accuracy through Google Search queries
  • Use of hyperparameter K to measure preferred response length by users and combining precision and recall in F1@K for evaluation
  • Empirical results of SAFE:
  • Achieves superhuman performance with a 72% agreement with human annotations
  • Wins 76% of randomly sampled disagreement cases out of 100
  • Cost-effectiveness comparison: SAFE is more than 20 times cost-effective compared to crowdsourced human annotators
  • Benchmarking on thirteen models from four model families using LongFact shows that larger language models generally exhibit better long-form factuality
  • Future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration
  • For further reference and replication purposes, the code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

License: CC BY 4.0

Abstract: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

Submitted to arXiv on 27 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.18802v1

In this paper, we delve into the thorough benchmarking of long-form factuality in large language models. We introduce a comprehensive set of 2,280 prompts covering 38 diverse topics called LongFact and propose a method for evaluating long-form factuality known as SAFE. Our approach leverages language-model agents to automatically assess a model's response by breaking it down into individual facts, refining them for self-containment, evaluating their relevance to the prompt, and verifying their accuracy through Google Search queries. We also introduce a hyperparameter K as a measure of preferred response length by users and combine precision and recall in F1@K for evaluation. Our empirical results demonstrate that SAFE achieves superhuman performance by agreeing with human annotations 72% of the time and winning 76% of randomly sampled disagreement cases out of 100. Additionally, SAFE proves to be more than 20 times cost-effective compared to crowdsourced human annotators. Furthermore, we conduct benchmarking on thirteen models from four model families (Gemini, GPT, Claude, PaLM-2) using LongFact and observe that larger language models generally exhibit better long-form factuality. Moving forward, future research avenues could explore enhancing language models' long-form factuality through improved pretraining/finetuning or external tool integration. However,<nl> while our work focuses on factuality concerning world knowledge correctness,<nl> there remains uncertainty in reliably measuring hallucination within long-form settings.<nl> Through our benchmarking efforts,<nl> we aim to demonstrate how robust dataset acquisition methods,<nl> model evaluation techniques,<nl> and metric aggregation can significantly enhance our understanding of language model capabilities in long-form scenarios.<nl> We hope that this work will inspire further research into both assessing and enhancing language models operating in long-form domains.<br> <br> <br> <br> The code for LongFact and SAFE is accessible at https://github.com/google-deepmind/long-form-factuality for reference and replication purposes.
Created on 01 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.