How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

AI-generated keywords: Language Models Calibration Question Answering Probabilistic Models Confidence Scores

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors: Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig
  • Central question: How to determine when language models confidently know the answer
  • Approach: Focus on calibration to align predicted probabilities with actual correctness probabilities
  • Evaluation of generative models: T5, BART, GPT-2 lack calibration in question answering tasks
  • Methods explored for enhancing calibration:
  • Fine-tuning
  • Post-hoc probability adjustments
  • Modifications to predicted outputs or inputs
  • Effectiveness of calibration methods demonstrated through experiments on diverse datasets
  • Analysis of strengths and limitations of proposed approaches for calibrating language models
  • Availability of code at https://github.com/jzbjyb/lm-calibration
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhengbao Jiang, Jun Araki, Haibo Ding, Graham Neubig

TACL 2021

Abstract: Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

Submitted to arXiv on 02 Dec. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2012.00955v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering," authors Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig delve into the issue of language models (LM) capturing various types of knowledge but still falling short in providing accurate answers. The central question they address is how to determine when language models confidently know the answer to a specific query. They approach this inquiry through the lens of calibration, which focuses on ensuring that a probabilistic model's predicted probabilities align well with the actual probabilities of correctness. The study evaluates three prominent generative models - T5, BART, and GPT-2 - to assess whether their probabilities in question answering tasks are effectively calibrated. The findings reveal a notable lack of calibration in these models. To address this discrepancy and enhance the correlation between confidence scores and correctness likelihood, the authors explore methods such as fine-tuning, post-hoc probability adjustments, and modifications to predicted outputs or inputs. Through experiments across diverse datasets, the effectiveness of these calibration methods is demonstrated. Additionally, an analysis is conducted to identify both strengths and limitations of the proposed approaches, shedding light on potential avenues for further improvement in calibrating language models. The authors have made their code available at https://github.com/jzbjyb/lm-calibration. Overall, this research contributes valuable insights into enhancing the reliability and accuracy of language models in question answering tasks by focusing on calibration techniques.
Created on 03 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.