This study focuses on teaching large multimodal models (LMMs) new skills without erasing their prior abilities. The researchers conducted sequential fine-tuning on five target skills and monitored general ability on eight held-out benchmarks across three model families. They observed that there was apparent "forgetting" on held-out tasks after narrow fine-tuning, but this loss in performance could partly recover at later stages. This behavior was linked to a measurable shift in the output token distribution, which was identified through a simple counting-bias probe that correlated with forgetting. Further analysis revealed that most of the shift in performance was driven by late MLP blocks rather than self-attention layers. Based on these findings, two simple and robust tuning recipes were proposed: updating only the self-attention projection layers or updating only the MLP Gate&Up layers while freezing the Down projection. These tuning strategies allowed for effective learning while limiting drift across different model families. Overall, this study provides insights into the learning and forgetting behavior of LMMs and offers practical recommendations to enhance model adaptation without sacrificing prior knowledge. The researchers hope that these findings will contribute to more stable and efficient continuous improvement of LMMs, ultimately reducing the environmental and financial costs associated with model adaptation. The study acknowledges limitations such as resource constraints and the need for further exploration with larger models and additional modalities like audio but sets a foundation for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts. The work is supported by ONR award N00014-23-1-2383 and U.S. DARPA ECOLE Program No. #HR00112390060.
- - Study focuses on teaching large multimodal models (LMMs) new skills without erasing prior abilities
- - Sequential fine-tuning conducted on five target skills with monitoring of general ability on eight held-out benchmarks across three model families
- - "Forgetting" observed on held-out tasks after narrow fine-tuning, but performance loss can partly recover at later stages
- - Measurable shift in output token distribution identified through counting-bias probe correlated with forgetting
- - Shift in performance driven by late MLP blocks rather than self-attention layers
- - Two tuning recipes proposed: updating only self-attention projection layers or updating only MLP Gate&Up layers while freezing Down projection
- - Strategies allow for effective learning while limiting drift across different model families
- - Insights provided into learning and forgetting behavior of LMMs with practical recommendations to enhance model adaptation without sacrificing prior knowledge
- - Acknowledgment of limitations such as resource constraints and need for further exploration with larger models and additional modalities like audio
- - Foundation set for future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts
SummaryResearchers studied how to teach big models new skills without making them forget what they already knew. They tried different ways of learning and found that some methods worked better than others. They noticed that the model's performance could drop at first but improve later on. Changes in how the model makes decisions were linked to this forgetting process. By focusing on specific parts of the model, they could control how well it learned new things.
Definitions- Large Multimodal Models (LMMs): Big computer programs that can understand and work with different types of information like text, images, and sounds.
- Fine-tuning: Adjusting a model's settings to make it better at a specific task.
- Measurable shift: A noticeable change that can be measured or tracked.
- Self-attention layers: Parts of a model that help it focus on important information within its input data.
- MLP blocks: Components in a model that process and transform data using mathematical operations.
Introduction:
Multimodal models have gained significant attention in recent years due to their ability to process and understand multiple forms of data such as text, images, and videos. These large multimodal models (LMMs) have shown impressive performance on various tasks, but they often require continuous fine-tuning to adapt to new skills or domains. However, this process can lead to the loss of previously learned abilities, also known as "forgetting." In this research paper, the authors aim to address this issue by proposing two simple tuning strategies that allow for effective learning while limiting forgetting.
Methodology:
The researchers conducted sequential fine-tuning on five target skills using three different model families: BERT-based LMMs (BERT-L), GPT-2-based LMMs (GPT-L), and a hybrid model combining both architectures (BERT-GPT-L). They evaluated the general ability of these models on eight held-out benchmarks after each round of narrow fine-tuning. Additionally, they used a counting-bias probe to measure the shift in output token distribution before and after fine-tuning.
Results:
The results showed that there was apparent forgetting on held-out tasks after narrow fine-tuning. However, this loss in performance could partly recover at later stages. The researchers found that most of the shift in performance was driven by late MLP blocks rather than self-attention layers. This behavior was linked to a measurable shift in the output token distribution identified through the counting-bias probe.
Tuning Strategies:
Based on these findings, two simple and robust tuning recipes were proposed: updating only the self-attention projection layers or updating only the MLP Gate&Up layers while freezing the Down projection. These strategies allowed for effective learning while limiting drift across different model families.
Implications:
This study provides valuable insights into the learning and forgetting behavior of LMMs and offers practical recommendations for enhancing model adaptation without sacrificing prior knowledge. By understanding the underlying mechanisms of forgetting, researchers can develop more stable and efficient continuous improvement methods for LMMs. This will ultimately reduce the environmental and financial costs associated with model adaptation.
Limitations and Future Research:
The study acknowledges limitations such as resource constraints and the need for further exploration with larger models and additional modalities like audio. The authors also suggest future research in areas such as alternative architectures, longer sequences, privacy leakage, safety considerations, and societal impacts to gain a deeper understanding of LMMs' behavior.
Conclusion:
In conclusion, this research paper sheds light on the learning and forgetting behavior of large multimodal models. The proposed tuning strategies offer practical solutions to enhance model adaptation without sacrificing prior knowledge. This work has important implications for the continuous improvement of LMMs and sets a foundation for future research in this area. With further advancements in this field, we can expect more robust and efficient multimodal models that can adapt to new skills while retaining their previous abilities.