UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

AI-generated keywords: Artificial Intelligence Commonsense Reasoning Multitask Benchmark Transfer Learning UNICORN

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Achieving commonsense reasoning in artificial intelligence has been a challenging task.
  • Recent advancements in research have renewed interest in this area, driven by new benchmarks and models.
  • The RAINBOW multitask benchmark evaluates the ability of commonsense models to generalize across tasks and datasets.
  • The cost equivalent curve evaluation method provides insights into factors impacting model performance and data efficiency.
  • Transfer learning consistently improves performance when following specific guidelines.
  • Question-answering-based commonsense datasets show strong transferability among themselves, while knowledge graphs do not exhibit similar behavior.
  • Larger models benefit more from transfer learning compared to smaller ones.
  • UNICORN is a universal commonsense reasoning model that excels across eight prominent benchmarks, showcasing state-of-the-art performance.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For associated code and data see https://github.com/allenai/rainbow

Abstract: Commonsense AI has long been seen as a near impossible goal -- until recently. Now, research interest has sharply increased with an influx of new benchmarks and models. We propose two new ways to evaluate commonsense models, emphasizing their generality on new tasks and building on diverse, recently introduced benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote research on commonsense models that generalize well over multiple tasks and datasets. Second, we propose a novel evaluation, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained language models, and transfer learning methods impacts performance and data efficiency. We perform extensive experiments -- over 200 experiments encompassing 4800 models -- and report multiple valuable and sometimes surprising findings, e.g., that transfer almost always leads to better or equivalent performance if following a particular recipe, that QA-based commonsense datasets transfer well with each other, while commonsense knowledge graphs do not, and that perhaps counter-intuitively, larger models benefit more from transfer than smaller ones. Last but not least, we introduce a new universal commonsense reasoning model, UNICORN, that establishes new state-of-the-art performance across 8 popular commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA (90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA (79.3%).

Submitted to arXiv on 24 Mar. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2103.13009v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the realm of artificial intelligence, achieving commonsense reasoning has long been considered a daunting challenge. However, recent advancements in research have sparked a renewed interest in this area, fueled by the emergence of new benchmarks and models. In a groundbreaking study titled "UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark," authors Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi propose innovative approaches to evaluating commonsense models. The first approach introduced is the RAINBOW multitask benchmark, designed to assess the ability of commonsense models to generalize across various tasks and datasets. This benchmark aims to drive research towards developing models that exhibit robust performance across diverse scenarios. Additionally, the authors put forth a novel evaluation method called the cost equivalent curve, which provides valuable insights into how different factors such as source datasets, pretrained language models, and transfer learning techniques impact model performance and data efficiency. Through an extensive series of experiments encompassing over 200 trials with 4800 models, the researchers uncover several noteworthy findings. They observe that transfer learning consistently leads to improved or comparable performance when following specific guidelines. Furthermore, they discover that question-answering-based commonsense datasets demonstrate strong transferability among themselves, whereas commonsense knowledge graphs do not exhibit similar behavior. Surprisingly, their results also reveal that larger models tend to benefit more from transfer learning compared to smaller ones. Notably, the study culminates in the introduction of UNICORN – a universal commonsense reasoning model that achieves state-of-the-art performance across eight prominent commonsense benchmarks. These benchmarks include aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA (90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%), and CommonsenseQA (79.3%). The success of UNICORN underscores its capability to excel in diverse commonsense reasoning tasks and solidifies its position as a leading model in this rapidly evolving field. In conclusion, this comprehensive study not only advances our understanding of commonsense AI but also sets new standards for evaluating and developing robust commonsense reasoning models with broad applicability across multiple domains and tasks.
Created on 19 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.