Open In App

What is Meta’s new V-JEPA model? [Explained]

Last Updated : 23 Feb, 2024
Like Article

Meta which was formerly known as Facebook is popularly known as a multinational technology company. It mainly focuses on technology, social media as well as AI research. It has developed various AI models exploring advanced machine learning. The AI models include the V-JEPA model, the I-JEPA model, and others.

Under the non-commercial license of Creative Commons, the V-JEPA model is released. It reflects the commitment towards the development of advanced AI and open science. Today in this article we will provide a glimpse of “What is Meta’s new V-JEPA model?”.

What is the V-JEPA model?

V-JEPA model is an exclusively trained vision model which is created using the feature prediction objective. It plays a vital role in understanding the advanced machine intelligence which will imitate the process of human learning. V-JEPA understands and learns directly from the videos without any external supervision.

Video Joint Embedding Predictive Architecture (V-JEPA) model is embodied of LeCun’s theory. It is featured to develop the World’s internal conceptual model. V-JEPA model is a trained AI model which could also lead towards a change in the world of AI models. Meta said that they released the V-JEPA model under the noncommercial license of Creative Commons which will help the researchers to conduct experiments with the model as well as  expand the model’s capabilities.

Features of the V-JEPA Model

The Video Joint Embedding Predictive Architecture model (V-JEPA model) offers various key features and capabilities which helps to differentiate it from other traditional video analysis. It plays a vital role in the transformative era of AI video learning.

  • The V-JEPA model is able to learn from unlabeled data thus reducing the time and resources required for training. It employs self-supervised learning to comprehend and anticipate the video content without the need of precisely labeled datasets.
  • With the adoption of V-JEPA model self-supervised learning, it becomes more adaptive and versatile across all tasks without the need to provide labeled data during the training stage.
  • In contrast to image reconstruction and pixel-level prediction, the V-JEPA model emphasizes video feature prediction. This approach results in more efficient training and better performance in future tasks.
  • The V-JEPA model not only delivers visual representations which are able to handle motion and appearance-based tasks, but also demonstrates its performance in describing complex occurrences within video data.

 Advancements and Applications of the V-JEPA Model 

  • The V-JEPA model can facilitate the automatic production of videos using narrative structures and visual elements through understanding and prediction, thus easing the production process.
  • The V-JEPA model can be employed in educational settings to produce interactive learning materials, annotate educational videos for content summarization, and provide personalized learning experiences based on student involvement and comprehension.
  • The V-JEPA model can analyze complex scenarios and give outcomes which makes it perfect for surveillance systems where it can detect suspicious activities or anomalies without human supervision.
  • The V-JEPA model can also learn from training videos and be able to provide real-time guidance during medical procedures, improving both education and patient care.
  • Applications of the V-JEPA model in the entertainment industry, will result in more immersive and interactive experiences for example in video where the AI characters learn and adapt to the player’s actions.
  • The flexibility and efficiency of the V-JEPA model make it a useful tool for researchers so that they can analyze multiplex video datasets across different scientific areas from environmental studies to behavioral science.

What is the I-JEPA model?

The I-JEPA (Image Joint Embedding Predictive Architecture) model, introduced by Yann LeCun, Meta’s Chief AI Scientist, is a major breakthrough in AI. The goal of this approach is to give machines a capability to create internal models of how the world works, perform complex tasks, and adapt to unseen circumstances more efficiently.

It is trained using self-supervised learning and has the ability to learn competitive off-the-shelf image representations without the need of extra knowledge that is externally encoded through hand-crafted image transformations. The I-JEPA model’s work was presented at CVPR 2023, and the training code and model checkpoints are open-sourced, paving the way for further exploration and collaboration in the AI community.

Features of the I-JEPA model

The I-JEPA (Image-based Joint-Embedding Predictive Architecture) model by Meta has a number of important characteristics in self-supervised learning from images. 

  • It fills the missing information using predictions based on abstract prediction targets and as a result, the model is able to learn more semantic features.
  • Unlike many other computer vision models used today, the I-JEPA model is more computationally efficient, needing fewer computing resources during training.
  • The model beats the other state-of-the-art models on computer vision tasks like classification, object counting and depth prediction, showing its high performance and effective use in different tasks.
  • Meta has released the training code and the model checkpoints of the I-JEPA model, enabling researchers to dig deeper into the work and collaborate to explore this artificial intelligence breakthrough even further.

Advancements and Applications of the I-JEPA Model 

  • The I-JEPA model shows that such structures are able to learn competitive off-the-shelf image representations without any extra knowledge encoded in hand-crafted image transformations.
  • The I-JEPA model shows an ability to learn meaningful image representations without the necessity of having extensive prior knowledge embedded by image transformations, leading to efficiency and scalability in capturing semantic features from images.
  • The I-JEPA model surpasses pixel-reconstruction techniques in ImageNet-1K linear probing and low-level tasks, including object counting and depth prediction.
  • Capabilities and applications of the I-JEPA model are consistent with Meta’s quest of AI that is more human-like, responsible open science.

Comparison Chart: V-JEPA model and I-JEPA model


V-JEPA model

I-JEPA model 

Learning Approach 

It can learn the task of filling in the missing or masked parts of a video in an abstract representation space via a self-supervised learning method

It uses a self-supervised learning strategy in which target blocks of different types are predicted from the context block within the same image using one single block

Model Type 

Non-generative model for video learning 

Image-based Joint-Embedding Predictive Architecture

Mask Methodology 

Masking out a significant part of a video, making a very small portion of the context visible and then asking the predictor to fill in the blanks in a dense vector space representation.

Predicts high-level abstractions and significant features from images, with a focus on capturing and predicting high-level information rather than pixel-level details.

Computational Efficiency 

The model is requiring fewer labeled samples and less effort in utilizing unlabeled data.

Saves significant computing resources during the training, useful for applications which before required a lot of manually labeled data.


Outperforms the previous video representation learning methods in frozen evaluation on image classification, action classification, and spatio-temporal action detection tasks.

Outperforms pixel-reconstruction methods in ImageNet-1K linear probing and low-level vision tasks such as object counting and depth prediction


It is able to discard unpredictable data to improve training and sample efficiency.

It predicts representations of different target positions in the same image, enabling it to improve the semantic level of the self-supervised representations without relying on extra knowledge that is encoded in image transformations.

Which Meta Model is Better: V-JEPA model or I-JEPA Model?

The V-JEPA model and the I-JEPA model both have many positive sides and major improvements in their field of application.

Talking about the V-JEPA Model

  • Trained on a dataset of 2 million videos obtained from public datasets.
  • Specially designed to predict features only from the video data without depending on any external supervision.
  • Shown remarkable performance across multiple downstream image and video tasks in frozen evaluation, with continuous improvement observed in all tasks and especially in tasks requiring motion.
  • Makes training 1.5 to 6 times more efficient, and can be trained completely with unlabeled data.
  • Offers potential for future use, especially in embodied AI and contextual AI assistants for future AR glasses.
  • Under a Creative Commons license, it was released, thereby promoting collaboration and further extensions from this pioneering work in the AI research community.

Talking about the I-JEPA Model:

  • Uses a self-supervised learning method, predicting representations of different target blocks within the same image from a single context block.
  • Outperforms pixel-reconstruction methods for the ImageNet-1K linear probing and low-level vision tasks of object counting and depth prediction.
  • Semantically meaningful image features are learned without the requirement of hand-crafted view augmentations, proving efficiency and scalability in learning semantic features from images.
  • Meta has open-sourced the training code and model checkpoints of I-JEPA, giving the researchers an opportunity to explore and collaborate even further on this breakthrough work in AI research.


Both the V-JEPA model and the I-JEPA model of Meta are clear advances in the field of artificial intelligence, especially in video and image understanding via self-supervised learning.

The V-JEPA model is efficient at learning from videos in an unsupervised way which makes it capable of achieving impressive performance in downstream tasks and offering training efficiency gains. In contrast, the I-JEPA model excels in image-based tasks, which make use of self-supervised learning for more efficient prediction of image representations.

FAQs – V-JEPA Model

What is a meta model in machine learning?

A meta model in machine learning is basically a training model which involves a model of various different tasks which has the goal to learn and gain the generalized knowledge which could be transformed to a new task.

What is JEPA?

JEPA is the architectural equipment which helps to predict representations from a single context block to various target blocks inside an image. 

What is a meta human like AI image creation model?

A meta human-like AI image creation model is the I-JEPA model. 

How do you explain the deep learning model?

Deep learning models can perform various tasks including recognition of complex patterns in sounds, images and texts. It produces accurate data insights and predictions. 

What is the concept of energy model?

Energy models are basically defined as the process in which the computer models are built with energy systems to analyze them. It often rectifies and changes the scenario analysis to the different assumptions of investigation with the terms and conditions related to the technical and economic conditions.

Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads