How to Teach Large Multimodal Models New Skills

AI-generated keywords: LMMs sequential fine-tuning forgetting behavior model adaptation practical recommendations

AI-generated Key Points

  • Study focuses on teaching large multimodal models (LMMs) new skills without erasing prior abilities
  • Sequential fine-tuning conducted on five target skills with monitoring of general ability on eight held-out benchmarks across three model families
  • "Forgetting" observed on held-out tasks after narrow fine-tuning, but performance loss can partly recover at later stages
  • Measurable shift in output token distribution identified through counting-bias probe correlated with forgetting
  • Shift in performance driven by late MLP blocks rather than self-attention layers
  • Two tuning recipes proposed: updating only self-attention projection layers or updating only MLP Gate&Up layers while freezing Down projection
  • Strategies allow for effective learning while limiting drift across different model families
  • Insights provided into learning and forgetting behavior of LMMs with practical recommendations to enhance model adaptation without sacrificing prior knowledge
  • Acknowledgment of limitations such as resource constraints and need for further exploration with larger models and additional modalities like audio
  • Foundation set for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

In submission. Code is available at https://github.com/jessemelpolio/LMM_CL
License: CC BY 4.0

Abstract: How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Submitted to arXiv on 09 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.08564v1

This study focuses on teaching large multimodal models (LMMs) new skills without erasing their prior abilities. The researchers conducted sequential fine-tuning on five target skills and monitored general ability on eight held-out benchmarks across three model families. They observed that there was apparent "forgetting" on held-out tasks after narrow fine-tuning, but this loss in performance could partly recover at later stages. This behavior was linked to a measurable shift in the output token distribution, which was identified through a simple counting-bias probe that correlated with forgetting. Further analysis revealed that most of the shift in performance was driven by late MLP blocks rather than self-attention layers. Based on these findings, two simple and robust tuning recipes were proposed: updating only the self-attention projection layers or updating only the MLP Gate&Up layers while freezing the Down projection. These tuning strategies allowed for effective learning while limiting drift across different model families. Overall, this study provides insights into the learning and forgetting behavior of LMMs and offers practical recommendations to enhance model adaptation without sacrificing prior knowledge. The researchers hope that these findings will contribute to more stable and efficient continuous improvement of LMMs, ultimately reducing the environmental and financial costs associated with model adaptation. The study acknowledges limitations such as resource constraints and the need for further exploration with larger models and additional modalities like audio but sets a foundation for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts. The work is supported by ONR award N00014-23-1-2383 and U.S. DARPA ECOLE Program No. #HR00112390060.
Created on 10 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.