Improved Text Classification via Test-Time Augmentation

AI-generated keywords: Test-Time Augmentation Text Classification NLP WILDS CivilComments Augmentation Policy

AI-generated Key Points

  • Test-time augmentation (TTA) is a technique used in image classification to improve model performance without additional training.
  • TTA has seen limited adoption in natural language processing (NLP) due to the difficulty of identifying label-preserving transformations.
  • The authors present augmentation policies that yield significant accuracy improvements with language models using TTA.
  • Augmentation policy design, such as the number of samples generated from a single non-deterministic augmentation, has a considerable impact on the benefit of TTA.
  • The authors apply an augmentation policy containing M transforms to generate M transformed inputs from a text input t.
  • A single prediction is generated by applying a simple average to the M + 1 logit predictions.
  • The study evaluates the performance of their method across the WILDS CivilComments dataset which consists of 448,000 comments made on Wikipedia talk pages labeled for toxicity and identity-based hate speech detection tasks.
  • Experiments show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches across binary classification tasks and datasets.
  • Certain combinations of augmentations yield better results than others.
  • This study demonstrates how test-time augmentation can be applied effectively to improve text classification models' performance without additional training.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Helen Lu, Divya Shanmugam, Harini Suresh, John Guttag

License: CC BY 4.0

Abstract: Test-time augmentation -- the aggregation of predictions across transformed examples of test inputs -- is an established technique to improve the performance of image classification models. Importantly, TTA can be used to improve model performance post-hoc, without additional training. Although test-time augmentation (TTA) can be applied to any data modality, it has seen limited adoption in NLP due in part to the difficulty of identifying label-preserving transformations. In this paper, we present augmentation policies that yield significant accuracy improvements with language models. A key finding is that augmentation policy design -- for instance, the number of samples generated from a single, non-deterministic augmentation -- has a considerable impact on the benefit of TTA. Experiments across a binary classification task and dataset show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches.

Submitted to arXiv on 27 Jun. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2206.13607v1

Test-time augmentation (TTA) is a well-established technique in image classification that involves aggregating predictions across transformed examples of test inputs to improve model performance without additional training. However, TTA has seen limited adoption in natural language processing (NLP) due to the difficulty of identifying label-preserving transformations. In this paper titled "Improved Text Classification via Test-Time Augmentation," the authors present augmentation policies that yield significant accuracy improvements with language models using TTA. The study shows that augmentation policy design, such as the number of samples generated from a single non-deterministic augmentation, has a considerable impact on the benefit of TTA. The authors apply an augmentation policy containing M transforms to generate M transformed inputs from a text input t. All M + 1 inputs, including the original and transformed ones, are then passed into a pre-trained classifier f to generate (M + 1) RC vectors containing class logit predictions. A single prediction is generated by applying a simple average to the M + 1 logit predictions. The authors choose averaging because it is the simplest version of TTA and suits their goal of understanding the baseline value of TTA in NLP. The study evaluates the performance of their method across a dataset and model architecture laid out in detail in subsequent sections. They use the WILDS CivilComments dataset which consists of 448,000 comments made on Wikipedia talk pages labeled for toxicity and identity-based hate speech detection tasks. Experiments show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches across binary classification tasks and datasets. The authors also find that certain combinations of augmentations such as Insertion+Swap+Synonym (PPDB)+Synonym (WordNet) or Deletion+Insertion+Swap+Synonym (PPDB) yield better results than others. Overall, this study demonstrates how test-time augmentation can be applied effectively to improve text classification models' performance without additional training. The authors' augmentation policies and findings provide valuable insights for future research in this area.
Created on 25 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.