Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

AI-generated keywords: Multi-hop Question Answering 2WikiMultiHopQA Structured and Unstructured Data Wikidata Evaluation Metrics

AI-generated Key Points

  • Introduction of a new dataset called 2WikiMultiHopQA for Multi-hop Question Answering
  • Dataset challenges reasoning and inference skills by requiring models to read multiple paragraphs
  • Comprehensive evidence information provided in the dataset enhances transparency and evaluation of model performance
  • Question-answer pairs are carefully designed to ensure multi-hop reasoning steps are necessary
  • Logical rules create natural yet challenging questions that demand multi-hop reasoning abilities
  • Structured data from Wikidata guarantees realistic questions that require advanced reasoning capabilities
  • Tasks include answer prediction, SFs prediction, and evidence generation with evaluation metrics like EM and F1 score
  • Joint metrics introduced to evaluate overall model capacity by combining precision and recall for various aspects
  • Four types of questions included: comparison, inference, compositional, and bridge comparison
  • Post-processing techniques applied for data quality assurance including balancing yes/no questions and eliminating ambiguous cases
  • Distractor paragraphs collected using bigram tf-idf similarity measures to provide context for each question
  • Dataset statistics established through benchmark setting using single-hop model for train-dev-test splits with five-fold cross-validation.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa

Accepted by COLING 2020
License: CC BY 4.0

Abstract: A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.

Submitted to arXiv on 02 Nov. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2011.01060v1

The researchers introduce a new dataset called 2WikiMultiHopQA for Multi-hop Question Answering that utilizes both structured and unstructured data from Wikipedia and Wikidata. The dataset aims to challenge reasoning and inference skills by requiring models to read multiple paragraphs in order to answer complex questions. Unlike existing datasets, 2WikiMultiHopQA provides comprehensive evidence information that outlines the reasoning path from the question to the answer, enhancing transparency and evaluation of model performance. The generation of question-answer pairs in the dataset is carefully designed using a pipeline and templates that ensure multi-hop reasoning steps are necessary for answering each question. Logical rules are employed to create natural yet challenging questions that demand multi-hop reasoning abilities from models. By leveraging structured data from Wikidata, the researchers guarantee that questions are not only realistic but also require advanced reasoning capabilities. The tasks formulated in this study include answer prediction, sentence-level supporting facts (SFs) prediction, and evidence generation. Evaluation metrics such as exact match (EM) and F1 score are used to assess model performance across these tasks. Joint metrics are introduced to evaluate the overall capacity of models by combining precision and recall for answer spans, SFs, and evidence. Furthermore, the dataset includes four types of questions: comparison, inference, compositional, and bridge comparison. Bridge questions involve a bridge entity connecting two paragraphs and represent a subtype of inference and compositional questions. To ensure data quality, post-processing techniques were applied to balance yes/no questions and eliminate ambiguous cases. Additionally, distractor paragraphs were collected using bigram tf-idf similarity measures to provide context for each question. Dataset statistics were established through a benchmark setting using a single-hop model for train-dev-test splits with five-fold cross-validation. Overall, this study contributes a valuable resource for evaluating multi-hop QA models by providing a challenging dataset with detailed explanations for predictions and rigorous evaluation criteria. The full dataset along with baseline models can be accessed on GitHub for further research purposes.
Created on 04 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.