How to Teach Large Multimodal Models New Skills

AI-generated keywords: LMMs sequential fine-tuning forgetting behavior model adaptation practical recommendations

AI-generated Key Points

Study focuses on teaching large multimodal models (LMMs) new skills without erasing prior abilities
Sequential fine-tuning conducted on five target skills with monitoring of general ability on eight held-out benchmarks across three model families
"Forgetting" observed on held-out tasks after narrow fine-tuning, but performance loss can partly recover at later stages
Measurable shift in output token distribution identified through counting-bias probe correlated with forgetting
Shift in performance driven by late MLP blocks rather than self-attention layers
Two tuning recipes proposed: updating only self-attention projection layers or updating only MLP Gate&Up layers while freezing Down projection
Strategies allow for effective learning while limiting drift across different model families
Insights provided into learning and forgetting behavior of LMMs with practical recommendations to enhance model adaptation without sacrificing prior knowledge
Acknowledgment of limitations such as resource constraints and need for further exploration with larger models and additional modalities like audio
Foundation set for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

arXiv: 2510.08564v1 - DOI (cs.AI)

In submission. Code is available at https://github.com/jessemelpolio/LMM_CL

License: CC BY 4.0

Abstract: How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Submitted to arXiv on 09 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.08564v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

This study focuses on teaching large multimodal models (LMMs) new skills without erasing their prior abilities. The researchers conducted sequential fine-tuning on five target skills and monitored general ability on eight held-out benchmarks across three model families. They observed that there was apparent "forgetting" on held-out tasks after narrow fine-tuning, but this loss in performance could partly recover at later stages. This behavior was linked to a measurable shift in the output token distribution, which was identified through a simple counting-bias probe that correlated with forgetting. Further analysis revealed that most of the shift in performance was driven by late MLP blocks rather than self-attention layers. Based on these findings, two simple and robust tuning recipes were proposed: updating only the self-attention projection layers or updating only the MLP Gate&Up layers while freezing the Down projection. These tuning strategies allowed for effective learning while limiting drift across different model families. Overall, this study provides insights into the learning and forgetting behavior of LMMs and offers practical recommendations to enhance model adaptation without sacrificing prior knowledge. The researchers hope that these findings will contribute to more stable and efficient continuous improvement of LMMs, ultimately reducing the environmental and financial costs associated with model adaptation. The study acknowledges limitations such as resource constraints and the need for further exploration with larger models and additional modalities like audio but sets a foundation for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts. The work is supported by ONR award N00014-23-1-2383 and U.S. DARPA ECOLE Program No. #HR00112390060.

- Study focuses on teaching large multimodal models (LMMs) new skills without erasing prior abilities
- Sequential fine-tuning conducted on five target skills with monitoring of general ability on eight held-out benchmarks across three model families
- "Forgetting" observed on held-out tasks after narrow fine-tuning, but performance loss can partly recover at later stages
- Measurable shift in output token distribution identified through counting-bias probe correlated with forgetting
- Shift in performance driven by late MLP blocks rather than self-attention layers
- Two tuning recipes proposed: updating only self-attention projection layers or updating only MLP Gate&Up layers while freezing Down projection
- Strategies allow for effective learning while limiting drift across different model families
- Insights provided into learning and forgetting behavior of LMMs with practical recommendations to enhance model adaptation without sacrificing prior knowledge
- Acknowledgment of limitations such as resource constraints and need for further exploration with larger models and additional modalities like audio
- Foundation set for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts

SummaryResearchers studied how to teach big models new skills without making them forget what they already knew. They tried different ways of learning and found that some methods worked better than others. They noticed that the model's performance could drop at first but improve later on. Changes in how the model makes decisions were linked to this forgetting process. By focusing on specific parts of the model, they could control how well it learned new things. Definitions- Large Multimodal Models (LMMs): Big computer programs that can understand and work with different types of information like text, images, and sounds. - Fine-tuning: Adjusting a model's settings to make it better at a specific task. - Measurable shift: A noticeable change that can be measured or tracked. - Self-attention layers: Parts of a model that help it focus on important information within its input data. - MLP blocks: Components in a model that process and transform data using mathematical operations.

Introduction: Multimodal models have gained significant attention in recent years due to their ability to process and understand multiple forms of data such as text, images, and videos. These large multimodal models (LMMs) have shown impressive performance on various tasks, but they often require continuous fine-tuning to adapt to new skills or domains. However, this process can lead to the loss of previously learned abilities, also known as "forgetting." In this research paper, the authors aim to address this issue by proposing two simple tuning strategies that allow for effective learning while limiting forgetting. Methodology: The researchers conducted sequential fine-tuning on five target skills using three different model families: BERT-based LMMs (BERT-L), GPT-2-based LMMs (GPT-L), and a hybrid model combining both architectures (BERT-GPT-L). They evaluated the general ability of these models on eight held-out benchmarks after each round of narrow fine-tuning. Additionally, they used a counting-bias probe to measure the shift in output token distribution before and after fine-tuning. Results: The results showed that there was apparent forgetting on held-out tasks after narrow fine-tuning. However, this loss in performance could partly recover at later stages. The researchers found that most of the shift in performance was driven by late MLP blocks rather than self-attention layers. This behavior was linked to a measurable shift in the output token distribution identified through the counting-bias probe. Tuning Strategies: Based on these findings, two simple and robust tuning recipes were proposed: updating only the self-attention projection layers or updating only the MLP Gate&Up layers while freezing the Down projection. These strategies allowed for effective learning while limiting drift across different model families. Implications: This study provides valuable insights into the learning and forgetting behavior of LMMs and offers practical recommendations for enhancing model adaptation without sacrificing prior knowledge. By understanding the underlying mechanisms of forgetting, researchers can develop more stable and efficient continuous improvement methods for LMMs. This will ultimately reduce the environmental and financial costs associated with model adaptation. Limitations and Future Research: The study acknowledges limitations such as resource constraints and the need for further exploration with larger models and additional modalities like audio. The authors also suggest future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts to gain a deeper understanding of LMMs' behavior. Conclusion: In conclusion, this research paper sheds light on the learning and forgetting behavior of large multimodal models. The proposed tuning strategies offer practical solutions to enhance model adaptation without sacrificing prior knowledge. This work has important implications for the continuous improvement of LMMs and sets a foundation for future research in this area. With further advancements in this field, we can expect more robust and efficient multimodal models that can adapt to new skills while retaining their previous abilities.

Created on 10 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

60.1%

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

cs.AI

59.6%

Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Com…

cs.AI

59.5%

LLaVA-Docent: Instruction Tuning with Multimodal Large Language Model to Supp…

cs.AI

59.2%

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectori…

cs.AI

59.2%

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

cs.AI

58.4%

Vision language models are blind

cs.AI

58.0%

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-t…

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.