LESS: Selecting Influential Data for Targeted Instruction Tuning

AI-generated keywords: LESS

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors discuss challenges in developing specialized capabilities in large language models (LLMs) for real-world applications
  • LESS algorithm offers a practical and efficient solution for targeted instruction tuning in LLMs
  • Algorithm constructs a gradient datastore with low-dimensional features for effective reuse and transferability
  • LESS demonstrates superior performance by selecting examples based on similarity to few-shot instances representing specific capabilities
  • Selected data exhibits high transferability, enabling smaller models to identify useful data for larger models and different model families
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen

Code and data are available at https://github.com/princeton-nlp/LESS

Abstract: Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

Submitted to arXiv on 06 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.04333v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "LESS: Selecting Influential Data for Targeted Instruction Tuning," authors Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen discuss the challenges of developing specialized capabilities in large language models (LLMs) for real-world applications. LESS offers a practical and efficient solution for targeted instruction tuning in LLMs, showcasing its potential to enhance model performance and applicability in diverse real-world scenarios. The algorithm constructs a gradient datastore with low-dimensional features that can be reused and transferred effectively. By selecting examples based on their similarity to few-shot instances representing specific capabilities, LESS demonstrates superior performance when training on a selected 5% subset of data compared to using the entire dataset across various downstream tasks. Importantly, the selected data exhibits high transferability, enabling smaller models to identify useful data for larger models and models from different families. <ks>Influence of Data Selection on LLMs</ks> While instruction tuning has enabled the development of general-purpose chatbots by leveraging combined datasets, tasks requiring specific skills such as reasoning necessitate a targeted approach to data selection. The authors introduce LESS, an algorithm designed to estimate data influences and perform Low-rank gradiEnt Similarity Search for selecting instruction data efficiently. LESS is characterized by its adaptability to the Adam optimizer and variable-length instruction data. <ks>Efficient Selection Process with LESS</ks> Through qualitative analysis, the authors highlight that LESS goes beyond surface-level cues to identify data exemplifying essential reasoning skills required for intended downstream applications. Overall, LESS offers a practical and efficient solution for targeted instruction tuning in LLMs, showcasing its potential to enhance model performance and applicability in diverse real-world scenarios.
Created on 19 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.