Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

AI-generated keywords: unsupervised paraphrasing summarization large language models (LLMs) zero-shot generations Impossible Distillation

AI-generated Key Points

  • Traditional approaches in unsupervised paraphrasing and summarization rely on task-specific surrogates like back-translation and autoencoding.
  • Weak supervision signals from these methods compared to the complexity of the target task necessitate engineered perturbations or complete re-training of teacher models.
  • Growing research explores using large language models (LLMs) for paraphrasing and summarization without supervision, with recent studies showing zero-shot generations achieving human-level quality.
  • Large-scale pre-training equips LLMs with knowledge to tackle complex tasks, but additional fine-tuning using instruction data or human feedback unlocks their full potential.
  • The text introduces an alternative paradigm by leveraging intrinsic paraphrastic knowledge within LLMs instead of relying on human annotation, aiming to enhance model reasoning, robustness, controllability, and language understanding.
  • The framework offers a generalized approach to data generation that eliminates reliance on teacher model proficiency in the target task, showcasing superior performance even with smaller-scale LMs like GPT2.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi

NAACL 2024
License: CC BY 4.0

Abstract: We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.

Submitted to arXiv on 26 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.16635v4

Traditional approaches in unsupervised paraphrasing and summarization often rely on task-specific surrogates like back-translation and autoencoding to guide models towards desired outputs. However, these methods provide weak supervision signals compared to the complexity of the target task. This necessitates engineered perturbations or complete re-training of teacher models. On the other hand, a growing body of research is exploring the use of large language models (LLMs) for paraphrasing and summarization without supervision. Recent studies indicate that zero-shot generations from LLMs can achieve human-level quality in various scenarios. Furthermore, LLMs have been extensively studied for their capabilities in solving tasks across different domains. While large-scale pre-training equips models with sufficient knowledge to tackle complex tasks, recent findings suggest that their full potential is unlocked through additional fine-tuning using instruction data or human feedback. However, this typically requires curated sets of annotated data. In contrast to this approach, our work introduces an alternative paradigm by leveraging the intrinsic paraphrastic knowledge within LLMs instead of relying on human annotation. Moreover, related works have explored data generation with LLMs to enhance model reasoning, robustness, controllability, and language understanding. These efforts typically involve distilling models with LM-generated data or extracting standalone corpora from LLMs for various applications. However, existing methods often impose strong assumptions on teacher LMs and require manually constructed prompts. In contrast, offers a generalized approach to data generation that eliminates reliance on the teacher model's proficiency in the target task. Our novel framework presents a groundbreaking method for paraphrasing and sentence summarization by distilling high-quality datasets and models from low-quality teachers incapable of performing these tasks effectively. By harnessing the paraphrastic proximity intrinsic to pre-trained LLMs like GPT2 and identifying generations from proximal subspaces within these models, we demonstrate superior performance compared to strong baselines even when using smaller-scale LMs like GPT2. Our evaluation across multiple benchmarks showcases improved diversity and fidelity in our distilled dataset compared to larger datasets while highlighting the potential for leveraging LM knowledge without human annotation constraints.
Created on 14 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.