Open In App

SORA AI – New Text-to-Video Generator

A few days ago, OpenAI released SORA, a text-to-video generation tool capable of generating HD videos just by writing a prompt! The videos generated by them are breathtakingly detailed and realistic. In this article, we will dive deep into the working, applications, and limitations of SORA.

What is SORA AI?

SORA AI is a state-of-the-art model specifically designed to generate short videos based on text prompts. It is not released to the public yet and is available to selected people for experimentation and risk evaluation. However, the Open AI team has released sample video outputs and a very brief technical report about SORA architecture details on its website. In Japanese, Sora means “sky”, and the name symbolize its “infinite creative possibilities.”



The model can perform the below task:

As per Open AI – Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.



Working of SORA AI

Let us understand the working details of SORA.

Dimensionality reduction – Video Compression network

The dimensionality reduction is based on the variational autoencoder and decoder (VAE) concept of images. The encoder part of VAE takes images in higher dimensions and maps them to lower dimensions called latent space. The decoder part of VAE takes the vectors in latent space and decodes them to generate output the same as the input image. Since videos are nothing but sequences of images the same concept can be extended to videos.

The video compression network takes raw video as input and outputs a latent representation that is compressed temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. A corresponding decoder model maps generated patents back to pixel space.

Spacetime latent patches

Transformers have text tokens; Sora has visual patches. an input video is divided into fixed-size cube regions(spatially and temporally) called patches. The concept of visual patches is borrowed from the design of ViT (Vission Transformer) and allows it to handle images/videos in a manner analogous to how transformers process sequential data like text. Thus the decompressed video is subsequently decomposed into spacetime patches.

Process of converting videos to lower dimensions and dividing them into patches.

Visual patches as described in ViT

Scaling transformers for video generation

This is the core of the SORA model. Sora is a diffusion transformer model. Given input noisy patches which are conditioned with information like text prompts, it’s trained to predict the original “clean” patches. diffusion transformers scale effectively for video models. As the compute size increases the quality of video also increases.

Diffusion process

The diffusion process is detailed below.

Language understanding

SORA train a highly descriptive captioner model and then uses sit to produce text captions for all videos in the training set. During inference, it leverages the captioner model to turn short user prompts into longer detailed captions that are sent to the video model.

Limitations of SORA AI

The open AI team have maintained that SORA faces challenges in the below scenarios.

Applications of SORA AI

A text-to-video model like SORA can be applied for the following use cases.

What are the alternatives to SORA AI?

Let us see some of the other text-to-video models available.

  1. CogVideo: It is the first open-source pre-trained transformer for text-to-video generation in the general domain. CogVideo builds upon a powerful text-to-image model (CogView2). It is known for High-Frame-Rate Generation. Compared to other text-to-video models, CogVideo is capable of generating videos with more frames per second, resulting in smoother and more realistic motion.
  2. Nuwa: It employs a “diffusion over diffusion” method to train models which utilizes an autoregressive generation mechanism for infinite image and video synthesis from text inputs, enabling the generation of long, HD-quality videos.
  3. Gen2 By Runway: This model is content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. The model provides fine-grained control over output characteristics and customization based on a few reference images.
  4. Google’s Lumiere: Google’s new video generation AI model Lumiere uses a new diffusion model called Space-Time-U-Net, or STUNet, that figures out where things are in a video (space) and how they simultaneously move and change (time)

What are the risks associated with SORA AI?

  1. Malicious content: Sora could be used for creating convincing fake content that are hateful, biased or harmful content.
  2. Societal Impact: Sora could be used for spreading of misinformation impacting target basic fabrics of modern-day society like elections, economics etc.
  3. Deepfake videos: It further raised concerns about deepfake video threats which is already an issue with gen ai technnlogy
  4. Privacy Violations: It could be used to impersonate individuals for purposes not know to them
  5. Identity Theft: Sora could be exploited for various malicious purposes, including identity theft, impersonation, or creating fake accounts for fraudulent activities.

How can I access SORA AI?

Sora is not available to public. It is not yet open source. The access is limited. It has been granted access only to select team of red teamers to assess critical areas for harms or risks. Also access has been given to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professional.

What does OpenAI Sora mean for the future?

With the release of SORA it is renewed the interest in text to video generation technology. It has set a benchmark for the large and small competitors to what can be achieved. In short term we believe many large scale competitors like Google, Meta will upgraded their current text to videl model to match or surpass the capabilities of SORA. It will also fuel the development of open source text to video model.

With the advancement of such start of the art technology , it will have impact on content creation editing works like. Tools like SORA can be used to

OpenAI’s Safety Measure for SORA AI Model

While the model is SOTA and impressive it raises concerns about transparency, accountability and ethical considerations. Open AI recognizes the possibility of misuse of such an advanced technology and is taking the below steps to address the concerns.

SORA AI – Frequently Asked Questions

What is SORA?

Sora is text to image generation multimodal model by Open AI which generates 1-minute-long video from text prompts.

Is SORA accessible?

SORA is not yet open sourced. It is available only to select individuals as of now for feedback. The Open AI team has released few videos showcasing the capabilities of the model on their website.

What is core technology that drives SORA?

Though the exact technical details are not revealed the Open AI team maintains that it is based on diffusion transformer technology.


Article Tags :