Accessing Vision Foundation Models at ImageNet-level Costs

AI-generated keywords: Vision foundation models training resources inaccessible data Proteus knowledge distillation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors address challenges posed by vision foundation models requiring significant training resources and inaccessible training data
  • Proposed solution called Proteus aims to distill foundation models into smaller equivalents on ImageNet-1K without original training data access
  • Proteus introduces three levels of training objectives (token, patch, feature) to enhance knowledge transfer efficacy
  • Achieves remarkable performance at ImageNet-level costs, comparable to DINOv2-L/14 across 15 benchmarks
  • Outperforms other vision foundation models like CLIP-L/14, OpenCLIP-L/14, and SynCLR-L/14
  • Breakthrough in distillation techniques enhances accessibility to training foundation models and opens up new research possibilities
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu

Abstract: Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).

Submitted to arXiv on 15 Jul. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2407.10366v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Accessing Vision Foundation Models at ImageNet-level Costs," authors Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu address the challenges posed by vision foundation models that require significant training resources and often have inaccessible training data. These models, such as CLIP and DINOv2, are known for their generalization ability due to massive training data but hinder the development of derivatives that could advance research in the field. To overcome these challenges, the authors propose a simple and effective solution called Proteus. <br> Proteus aims to distill foundation models into smaller equivalents on ImageNet-1K without requiring access to the original training data. By removing designs from conventional knowledge distillation settings that lead to dataset bias, Proteus introduces three levels of training objectives - token, patch, and feature - to enhance knowledge transfer efficacy. This approach allows Proteus to be trained at ImageNet-level costs while still achieving remarkable performance. Utilizing DINOv2-g/14 as the teacher model,<br> Proteus-L/14 demonstrates comparable performance to the Oracle method DINOv2-L/14 across 15 benchmarks. Impressively,<br> Proteus outperforms other vision foundation models like CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B), and SynCLR-L/14 (600M). This breakthrough in distillation techniques not only enhances accessibility to training foundation models but also opens up new possibilities for research in the broader community.
Created on 07 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.