Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
AI-generated Key Points
- Text classification of unseen classes is a significant challenge in Natural Language Processing.
- Two main types of methods used for this task are similarity-based approaches and zero-shot text classification approaches.
- Existing literature lacks consistent comparisons between these two types of methods.
- A systematic evaluation was conducted in the study on different similarity-based and zero-shot text classification methods for unseen classes.
- Benchmarking various state-of-the-art approaches was done on four text classification datasets, including a new dataset from the medical domain.
- Novel baselines using SimCSE and SBERT embeddings were proposed to improve results.
- A new similarity-based approach called Lbl2TransformerVec was introduced and outperformed existing state-of-the-art methods in unsupervised text classification tasks.
- Similarity-based approaches generally outperformed zero-shot methods across most cases in the experiments.
- Advanced embeddings like SimCSE or SBERT led to further improvements in similarity-based classification results compared to simpler text representations.
Authors: Tim Schopf, Daniel Braun, Florian Matthes
Abstract: Text classification of unseen classes is a challenging Natural Language Processing task and is mainly attempted using two different types of approaches. Similarity-based approaches attempt to classify instances based on similarities between text document representations and class description representations. Zero-shot text classification approaches aim to generalize knowledge gained from a training task by assigning appropriate labels of unknown classes to text documents. Although existing studies have already investigated individual approaches to these categories, the experiments in literature do not provide a consistent comparison. This paper addresses this gap by conducting a systematic evaluation of different similarity-based and zero-shot approaches for text classification of unseen classes. Different state-of-the-art approaches are benchmarked on four text classification datasets, including a new dataset from the medical domain. Additionally, novel SimCSE and SBERT-based baselines are proposed, as other baselines used in existing work yield weak classification results and are easily outperformed. Finally, the novel similarity-based Lbl2TransformerVec approach is presented, which outperforms previous state-of-the-art approaches in unsupervised text classification. Our experiments show that similarity-based approaches significantly outperform zero-shot approaches in most cases. Additionally, using SimCSE or SBERT embeddings instead of simpler text representations increases similarity-based classification results even further.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.