On the generalization of language models from in-context learning and finetuning: a controlled study

AI-generated keywords: Generalization Language Models In-Context Learning Fine-Tuning Inductive Biases

AI-generated Key Points

  • Study focused on generalization of language models from in-context learning and fine-tuning
  • Constructed novel datasets to evaluate and improve models' ability to generalize
  • Used reversal dataset with descriptions of fictional celebrities and a semantic structure with replaced terms
  • In data-matched settings, in-context learning showed more flexible generalization than fine-tuning
  • Fine-tuning could also generalize effectively within a larger structure of knowledge
  • Proposed method involving adding in-context inferences to fine-tuning data to enhance generalization
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Andrew K. Lampinen, Arslan Chaudhry, Stephanie C. Y. Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, James L. McClelland

License: CC BY 4.0

Abstract: Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning -- from failing to generalize to simple reversals of relations they are trained on, to missing logical deductions that can be made from trained information. These failures to generalize from fine-tuning can hinder practical application of these models. However, language models' in-context learning shows different inductive biases, and can generalize better in some of these cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' ability to generalize from finetuning data. The datasets are constructed to isolate the knowledge in the dataset from that in pretraining, to create clean tests of generalization. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.

Submitted to arXiv on 01 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.00661v1

In the study on the generalization of language models from in-context learning and fine-tuning, researchers explored the differences in generalization between these two learning methods. They constructed novel datasets to evaluate and improve models' ability to generalize from fine-tuning data by isolating knowledge in the dataset from that in pretraining. The datasets were designed to create clean tests of generalization, exposing pretrained large models to controlled subsets of information either in context or through fine-tuning. One dataset used was the reversal dataset proposed by Berglund et al., containing descriptions of fictional celebrities with names either preceding or following the description. Another benchmark involved a semantic structure with a hierarchy of properties and relations based on real-world categories and relations. To make this structure novel to pretrained models, all nouns, adjectives, and verbs were replaced with nonsense terms. Despite potential tokenization challenges, short nonsense words were generated using plausible combinations of phonemes for English. For training, facts about the semantic hierarchy were assembled into synthetic articles resembling Wikipedia entries, along with QA examples to maintain question-answering capabilities during fine-tuning. The train set ensured that all necessary facts for test questions were presented at least once. Overall, the study found that in data-matched settings, in-context learning exhibited more flexible generalization than fine-tuning. However, there were also cases where fine-tuning could generalize effectively within a larger structure of knowledge. To enhance generalization from fine-tuning, a method involving adding in-context inferences to finetuning data was proposed and shown to improve performance across various datasets and benchmarks. These findings have implications for understanding the inductive biases of different learning modes in language models and offer practical insights for improving their performance.
Created on 15 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.