Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

AI-generated keywords: Natural Language Processing Limited Labelled Data General Large Language Models Specialized Small Models Performance Variance

AI-generated Key Points

  • Researchers explore NLP tasks with limited labelled data
  • Compare general large language models vs. fine-tuning smaller specialized models
  • Aim to identify how many labelled samples are needed for specialized models to surpass general models
  • Investigate techniques like fine-tuning, instruction-tuning, prompting, and in-context learning across seven language models
  • Find that specialized models often only need a few samples (10-1000) to outperform or match general models
  • Number of required labels influenced by dataset and task characteristics; multi-class datasets may need 100 labels, binary datasets up to 5000 labels
  • Performance variance increases label requirements by an average of 100-200%, up to 1500% in specific cases
  • Study compares data-efficient approaches using large language models and addresses ethical considerations regarding dataset usage and biases
  • Impact statement includes CO2 emissions from compute resources used during experiments and efforts to reduce resource consumption
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Branislav Pecher, Ivan Srba, Maria Bielikova

License: CC BY 4.0

Abstract: When solving NLP tasks with limited labelled data, researchers can either use a general large language model without further update, or use a small number of labelled examples to tune a specialised smaller model. In this work, we address the research gap of how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 7 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $10 - 1000$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with this number being significantly lower on multi-class datasets (up to $100$) than on binary datasets (up to $5000$). When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$ and even up to $1500\%$ in specific cases.

Submitted to arXiv on 20 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.12819v2

In this study, the researchers delve into the realm of Natural Language Processing (NLP) tasks with limited labelled data. They explore the effectiveness of utilizing general large language models versus fine-tuning smaller specialized models with a small number of labelled examples. The research aims to bridge the gap in understanding how many labelled samples are required for these specialized small models to surpass the performance of general large models while also considering performance variance. The study investigates various techniques such as fine-tuning, instruction-tuning, prompting, and in-context learning across seven different language models. By analyzing their behavior on eight representative text classification tasks with varying characteristics, the researchers identify performance break-even points. Surprisingly, they find that specialized models often only require a few samples (on average 10-1000) to outperform or match the performance of general models. Furthermore, the number of required labels is shown to be heavily influenced by dataset and task characteristics. For multi-class datasets, as few as 100 labels may be sufficient, whereas binary datasets may require up to 5000 labels for optimal performance. When factoring in performance variance, the number of required labels increases by an average of 100-200%, reaching up to 1500% in specific cases. The study also delves into related work within the field and highlights comparisons between different data-efficient approaches using large language models. These comparisons often focus on specific settings and methodologies such as model sizes, approaches used, and the number of labelled samples. Additionally, ethical considerations regarding dataset usage and potential biases in large language models are addressed. Moreover, an impact statement reveals that significant compute resources were utilized during experiments due to multiple training runs and evaluation processes across various models. The total estimated CO2 emissions from these computations are reported along with efforts made to reduce resource consumption where possible. Overall, this study contributes valuable insights into optimizing NLP tasks with limited labelled data by showcasing the efficacy of specialized small models and shedding light on factors influencing their performance compared to general large language models.
Created on 22 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.