Revisiting Transformer-based Models for Long Document Classification

AI-generated keywords: Long Document Classification Transformer-based models Sparse attention Hierarchical Transformers Pre-trained models

AI-generated Key Points

  • Bias towards short text sequences in recent literature on text classification
  • Challenges with encoding multi-page multi-paragraph documents efficiently using traditional Transformer-based models
  • Various approaches developed to address the issue, such as sparse attention and hierarchical encoding methods
  • Strategies for splitting documents into segments for efficient encoding by pre-trained models
  • Computational overhead of vanilla Transformers in long document classification due to O(n^2) time and memory complexity
  • Emergence of long-document Transformers designed to handle longer sequences effectively
  • Performance comparison of traditional BERT variants with CNN or RNN-based models on datasets like MIMIC-III with longer documents
  • Success of pre-train–fine-tune paradigm in long document classification
  • Significant contributions made by analyzing different components of Transformer-based long document classification models
  • Improvements in performance demonstrated when models can process more text, especially with a sparse attention model capable of processing up to 4096 tokens
  • Refutation of criticisms against Transformers for long document classification through experiments
  • Valuable insights provided for effective application of Transformer-based models in handling longer texts
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiang Dai, Ilias Chalkidis, Sune Darkner, Desmond Elliott

Findings of EMNLP 2022
License: CC BY 4.0

Abstract: The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.

Submitted to arXiv on 14 Apr. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2204.06683v2

In recent literature on text classification, there is a bias towards short text sequences like sentences or paragraphs. However, in real-world applications, multi-page multi-paragraph documents are common and cannot be efficiently encoded by traditional Transformer-based models. To address this issue, various approaches have been developed, such as sparse attention and hierarchical encoding methods. involves adjusting aspects like the size of local attention windows and the use of global attention to improve efficiency. On the other hand, focus on strategies for splitting documents into segments that can be encoded by pre-trained models. These approaches have been tested on four document classification datasets across different domains. One key challenge in long document classification is the computational overhead of vanilla Transformers due to their O(n^2) time and memory complexity in attending to all tokens in a sequence. This limitation has led to the emergence of long-document Transformers designed to handle longer sequences more effectively. Experiments conducted on datasets with longer documents, such as MIMIC-III with an average length of 2,000 words, have shown that traditional BERT variants perform worse than CNN or RNN-based models. This highlights the need to understand how Transformer-based models perform when classifying actually long documents. The study aims to transfer the success of pre-train–fine-tune paradigm to long document classification and makes significant contributions by analyzing different components of Transformer-based long document classification models. The experiments conducted on datasets like MIMIC-III demonstrate clear improvements in performance when models can process more text. , a sparse attention model capable of processing up to 4096 tokens, shows competitive results with CNN-based models on MIMIC-III. outperform CNN-based models significantly by splitting documents into small overlapping segments for efficient encoding by pre-trained models. Overall, these experiments refute criticisms against Transformers for long document classification and provide valuable insights for applying Transformer-based models effectively in handling longer texts. The study is funded by various organizations including Innovation Fund Denmark and CSIRO Precision Health Future Science Platform under projects like AI4Xray.
Created on 24 Jul. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.