VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

AI-generated keywords: VL-JEPA

AI-generated Key Points

VL-JEPA is a novel vision-language model operating on a Joint Embedding Predictive Architecture (JEPA) that predicts continuous embeddings of target texts in an abstract representation space.
The model focuses on task-relevant semantics while abstracting away surface-level linguistic variability, improving efficiency and real-time response requirements.
VL-JEPA supports selective decoding, reducing operations by 2.85x without sacrificing performance compared to non-adaptive uniform decoding.
The model's embedding space facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without architectural modifications.
VL-JEPA combines the strengths of CLIP and VLMs by leveraging web-scale noisy image-text pairs for open-domain features and supporting conditional generation tasks with a readout text decoder.
Efforts to improve efficiency in Vision-Language Models include updating only a subset of parameters during training, exploring methods like parameter pruning or token reduction for inference efficiency, and using small VLMs and heuristics for real-time applications.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

arXiv: 2512.10942v1 - DOI (cs.CV)

License: CC BY 4.0

Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Submitted to arXiv on 11 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.10942v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this study, the authors introduce VL-JEPA, a novel vision-language model that operates on a Joint Embedding Predictive Architecture (JEPA). Unlike traditional Vision-Language Models (VLMs) that generate tokens sequentially, VL-JEPA predicts continuous embeddings of target texts in an abstract representation space. This approach allows the model to focus on task-relevant semantics while abstracting away surface-level linguistic variability. <break> <break> To improve efficiency and real-time response requirements, VL-JEPA utilizes a lightweight text decoder only when necessary for translation. The model also supports selective decoding, reducing the number of operations by 2.85x without sacrificing performance compared to non-adaptive uniform decoding. Additionally, VL-JEPA's embedding space naturally facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without requiring any architectural modifications. <break> <break> The proposed VL-JEPA combines the strengths of both CLIP and VLMs by leveraging web-scale noisy image-text pairs for strong open-domain features while supporting conditional generation tasks with a readout text decoder. Furthermore, VL-JEPA is more efficient at learning in latent space compared to generative VLMs that optimize directly in data space. <break> <break> Efforts to improve efficiency in Vision-Language Models include updating only a subset of parameters during training and exploring methods such as parameter pruning or token reduction for inference efficiency. Real-time applications benefit from small VLMs and heuristics to reduce query frequency during asynchronous inference. <break> <break> Overall, the introduction of VL-JEPA represents a significant advancement in vision-language modeling by addressing issues related to cost-effectiveness, real-time response requirements, and task coverage across various applications such as captioning, retrieval, visual question answering, action tracking, reasoning, and planning.

- VL-JEPA is a novel vision-language model operating on a Joint Embedding Predictive Architecture (JEPA) that predicts continuous embeddings of target texts in an abstract representation space.
- The model focuses on task-relevant semantics while abstracting away surface-level linguistic variability, improving efficiency and real-time response requirements.
- VL-JEPA supports selective decoding, reducing operations by 2.85x without sacrificing performance compared to non-adaptive uniform decoding.
- The model's embedding space facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without architectural modifications.
- VL-JEPA combines the strengths of CLIP and VLMs by leveraging web-scale noisy image-text pairs for open-domain features and supporting conditional generation tasks with a readout text decoder.
- Efforts to improve efficiency in Vision-Language Models include updating only a subset of parameters during training, exploring methods like parameter pruning or token reduction for inference efficiency, and using small VLMs and heuristics for real-time applications.

Summary- VL-JEPA is a new type of model that can understand both pictures and words together using a special architecture called JEPA. - This model helps to focus on the important meanings of tasks while ignoring unimportant details in language, making it faster and able to respond quickly. - VL-JEPA can choose what parts to pay attention to when decoding information, making it more efficient without losing accuracy. - The model's special space allows it to do things like classify objects, find videos related to text, and answer questions about images without changing its basic structure. - VL-JEPA combines two other models' strengths by using lots of image-text pairs from the internet for features and being able to generate text based on conditions. Definitions1. Vision-Language Model (VL): A type of computer program that can understand both images and words together. 2. Joint Embedding Predictive Architecture (JEPA): A special design that helps the model predict how different pieces of information are related in a shared space. 3. Semantics: The important meanings or ideas behind words or images, rather than just their surface details. 4. Decoding: Figuring out what information is being communicated from a set of symbols or data. 5. Open-vocabulary classification: Sorting things into categories without having a limited list of options beforehand.

Introduction

Vision-Language Models (VLMs) have gained significant attention in recent years due to their ability to process and generate text from visual inputs. These models have shown impressive performance in various tasks such as captioning, retrieval, visual question answering, and more. However, traditional VLMs face challenges related to efficiency and real-time response requirements. In this research paper, the authors introduce VL-JEPA, a novel vision-language model that operates on a Joint Embedding Predictive Architecture (JEPA). This approach addresses issues related to cost-effectiveness, real-time response requirements, and task coverage across various applications.

The Problem with Traditional VLMs

Traditional VLMs generate tokens sequentially which can be time-consuming and computationally expensive. This approach also limits the model's focus on surface-level linguistic variability rather than task-relevant semantics. Additionally, these models require architectural modifications for different tasks such as open-vocabulary classification or text-to-video retrieval. Moreover, traditional VLMs optimize directly in data space which can be inefficient compared to optimizing in latent space. This leads to longer training times and higher computational costs.

The Solution: VL-JEPA

VL-JEPA tackles these challenges by predicting continuous embeddings of target texts in an abstract representation space instead of generating tokens sequentially. This allows the model to focus on task-relevant semantics while abstracting away surface-level linguistic variability. To improve efficiency and real-time response requirements further, VL-JEPA utilizes a lightweight text decoder only when necessary for translation. The model also supports selective decoding which reduces the number of operations without sacrificing performance compared to non-adaptive uniform decoding. Additionally, VL-JEPA's embedding space naturally facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without requiring any architectural modifications.

Combining the Strengths of CLIP and VLMs

VL-JEPA combines the strengths of both CLIP and VLMs by leveraging web-scale noisy image-text pairs for strong open-domain features while supporting conditional generation tasks with a readout text decoder. This approach allows the model to perform well on various vision-language tasks without sacrificing efficiency.

Efforts to Improve Efficiency in Vision-Language Models

The research paper also discusses efforts to improve efficiency in Vision-Language Models, such as updating only a subset of parameters during training or exploring methods like parameter pruning or token reduction for inference efficiency. Real-time applications can benefit from small VLMs and heuristics to reduce query frequency during asynchronous inference.

Conclusion

In conclusion, VL-JEPA represents a significant advancement in vision-language modeling by addressing issues related to cost-effectiveness, real-time response requirements, and task coverage across various applications. The proposed JEPA architecture allows the model to focus on task-relevant semantics while abstracting away surface-level linguistic variability. Furthermore, VL-JEPA's embedding space facilitates open-vocabulary classification and supports various vision-language tasks without requiring any architectural modifications. With further developments in this area, we can expect more efficient and effective vision-language models in the future.

Created on 29 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

64.1%

Scaling 4D Representations

cs.CV

62.7%

Foundational Models Defining a New Era in Vision: A Survey and Outlook

cs.CV

61.8%

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders …

cs.CV

61.5%

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

cs.CV

61.5%

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language U…

cs.CV

60.3%

Improved Baselines with Visual Instruction Tuning

cs.CV

60.1%

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.