VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

AI-generated keywords: VL-JEPA

AI-generated Key Points

  • VL-JEPA is a novel vision-language model operating on a Joint Embedding Predictive Architecture (JEPA) that predicts continuous embeddings of target texts in an abstract representation space.
  • The model focuses on task-relevant semantics while abstracting away surface-level linguistic variability, improving efficiency and real-time response requirements.
  • VL-JEPA supports selective decoding, reducing operations by 2.85x without sacrificing performance compared to non-adaptive uniform decoding.
  • The model's embedding space facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without architectural modifications.
  • VL-JEPA combines the strengths of CLIP and VLMs by leveraging web-scale noisy image-text pairs for open-domain features and supporting conditional generation tasks with a readout text decoder.
  • Efforts to improve efficiency in Vision-Language Models include updating only a subset of parameters during training, exploring methods like parameter pruning or token reduction for inference efficiency, and using small VLMs and heuristics for real-time applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

License: CC BY 4.0

Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Submitted to arXiv on 11 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.10942v1

, , , , In this study, the authors introduce VL-JEPA, a novel vision-language model that operates on a Joint Embedding Predictive Architecture (JEPA). Unlike traditional Vision-Language Models (VLMs) that generate tokens sequentially, VL-JEPA predicts continuous embeddings of target texts in an abstract representation space. This approach allows the model to focus on task-relevant semantics while abstracting away surface-level linguistic variability. <break> <break> To improve efficiency and real-time response requirements, VL-JEPA utilizes a lightweight text decoder only when necessary for translation. The model also supports selective decoding, reducing the number of operations by 2.85x without sacrificing performance compared to non-adaptive uniform decoding. Additionally, VL-JEPA's embedding space naturally facilitates open-vocabulary classification, text-to-video retrieval, and discriminative Visual Question Answering (VQA) without requiring any architectural modifications. <break> <break> The proposed VL-JEPA combines the strengths of both CLIP and VLMs by leveraging web-scale noisy image-text pairs for strong open-domain features while supporting conditional generation tasks with a readout text decoder. Furthermore, VL-JEPA is more efficient at learning in latent space compared to generative VLMs that optimize directly in data space. <break> <break> Efforts to improve efficiency in Vision-Language Models include updating only a subset of parameters during training and exploring methods such as parameter pruning or token reduction for inference efficiency. Real-time applications benefit from small VLMs and heuristics to reduce query frequency during asynchronous inference. <break> <break> Overall, the introduction of VL-JEPA represents a significant advancement in vision-language modeling by addressing issues related to cost-effectiveness, real-time response requirements, and task coverage across various applications such as captioning, retrieval, visual question answering, action tracking, reasoning, and planning.
Created on 29 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.