PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

AI-generated keywords: PromptKD

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Zheng Li, Xiang Li, Xinyi Fu, Xing Zhang, Weiqiang Wang, and Jian Yang introduce PromptKD for enhancing vision-language models (VLMs)
  • PromptKD uses prompts as distillers to transfer knowledge from large teacher models to smaller target models
  • Framework consists of two key stages: pre-training a large CLIP teacher model with domain-specific labels and aligning logits via KL divergence and learnable prompts in the subsequent stage
  • Eliminates need for labeled data by leveraging unlabeled images within the domain
  • Ability to perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP
  • Practical mechanism established for pre-storing text features as shared class vectors between teacher and student models
  • Demonstrated effectiveness through experiments on 11 datasets across various domains
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zheng Li, Xiang Li, Xinyi Fu, Xing Zhang, Weiqiang Wang, Jian Yang

CVPR 2024. Project Page: https://zhengli97.github.io/PromptKD/. Code: https://github.com/zhengli97/PromptKD

Abstract: Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

Submitted to arXiv on 05 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.02781v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models," authors Zheng Li, Xiang Li, Xinyi Fu, Xing Zhang, Weiqiang Wang, and Jian Yang introduce an innovative approach to enhancing vision-language models (VLMs) such as CLIP for specific domains. Their work focuses on utilizing prompts not only for learning but also as effective distillers for transferring knowledge from larger teacher models to smaller target models. The proposed framework consists of two key stages. In the initial stage, a large CLIP teacher model is pre-trained using domain-specific (few-shot) labels. Leveraging the unique decoupled-modality characteristics of CLIP, the text features are pre-computed and stored as class vectors through the teacher text encoder. In the subsequent stage, these stored class vectors are shared between the teacher and student image encoders to calculate predicted logits. By aligning the logits of both models via KL divergence and utilizing learnable prompts, the student image encoder is encouraged to generate probability distributions similar to those of the teacher model. This process eliminates the need for labeled data, allowing the algorithm to leverage a vast amount of unlabeled images within the domain. One significant aspect of this framework is its ability to perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP. Additionally, a practical mechanism is established for pre-storing text features as shared class vectors between teacher and student models. The effectiveness of PromptKD is demonstrated through extensive experiments on 11 datasets across various domains. The well-trained student image encoders and pre-stored text features are then utilized for inference purposes. This research represents a novel contribution in leveraging prompts for knowledge distillation in VLMs and showcases promising results in enhancing vision-language models without relying on labeled data sources.
Created on 05 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.