SORA AI – New Text-to-Video Generator

A few days ago, OpenAI released SORA, a text-to-video generation tool capable of generating HD videos just by writing a prompt! The videos generated by them are breathtakingly detailed and realistic. In this article, we will dive deep into the working, applications, and limitations of SORA.

What is SORA AI?

SORA AI is a state-of-the-art model specifically designed to generate short videos based on text prompts. It is not released to the public yet and is available to selected people for experimentation and risk evaluation. However, the Open AI team has released sample video outputs and a very brief technical report about SORA architecture details on its website. In Japanese, Sora means “sky”, and the name symbolize its “infinite creative possibilities.”

The model can perform the below task:

Generate a video solely from text instructions,
Take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small details.
Take an existing video and extend it or fill in the missing frame.

As per Open AI – Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.

Table of Content

What is SORA AI?
Working of SORA AI
Limitations of SORA AI
Applications of SORA AI
Alternatives to SORA AI
Risks associated with SORA AI
How can I access SORA AI?
What does OpenAI Sora mean for the future?
OpenAI’s Safety Measure for SORA AI Model

Working of SORA AI

Let us understand the working details of SORA.

Dimensionality reduction – Video Compression network

The dimensionality reduction is based on the variational autoencoder and decoder (VAE) concept of images. The encoder part of VAE takes images in higher dimensions and maps them to lower dimensions called latent space. The decoder part of VAE takes the vectors in latent space and decodes them to generate output the same as the input image. Since videos are nothing but sequences of images the same concept can be extended to videos.

The video compression network takes raw video as input and outputs a latent representation that is compressed temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. A corresponding decoder model maps generated patents back to pixel space.

Spacetime latent patches

Transformers have text tokens; Sora has visual patches. an input video is divided into fixed-size cube regions(spatially and temporally) called patches. The concept of visual patches is borrowed from the design of ViT (Vission Transformer) and allows it to handle images/videos in a manner analogous to how transformers process sequential data like text. Thus the decompressed video is subsequently decomposed into spacetime patches.

Process of converting videos to lower dimensions and dividing them into patches.

Visual patches as described in ViT

Scaling transformers for video generation

This is the core of the SORA model. Sora is a diffusion transformer model. Given input noisy patches which are conditioned with information like text prompts, it’s trained to predict the original “clean” patches. diffusion transformers scale effectively for video models. As the compute size increases the quality of video also increases.

Diffusion process

The diffusion process is detailed below.

Forward diffusion: The model takes a clear video as input and adds noise to it.
Reverse diffusion: The model aims to reconstruct the original clear video from the noisy version.
Guided Generation -Between each block of encoder and decoder we have a cross-attention layer. This cross-attention layer takes the embeddings of text embeddings and the output of the previous layer that helps in guided video generation based on input text.

Language understanding

SORA train a highly descriptive captioner model and then uses sit to produce text captions for all videos in the training set. During inference, it leverages the captioner model to turn short user prompts into longer detailed captions that are sent to the video model.

Limitations of SORA AI

The open AI team have maintained that SORA faces challenges in the below scenarios.

Struggle with accurately simulating the physics of a complex scene and may not understand specific instances of cause and effect.
It doesn’t have an implicit understanding of physics law and fails to model the simulation accordingly when given a complex prompt.
It encounters difficulties maintaining spatial accuracy when given a large complex prompt, for example, mixing up left and right.
It doesn’t accurately model the physics of basic interactions.
It has trouble following a trajectory.

Applications of SORA AI

A text-to-video model like SORA can be applied for the following use cases.

Enhancing Media Production: Sora can facilitate faster editing and post-production works by automatically providing scene segmentation and scene enhancement, color correction, automatic noise reduction, and lighting adjustments to improve visual quality without manual intervention.
Educational Content: A New Frontier: Sora can help deliver visually rich content by transforming static text or images into animations very easily which empowers the teaching methodology as well as the learning experience.
Advertising and Marketing Innovations: Sora can help generate target driven campaigns for ads.
Entertainment and Storytelling: SORA can be used to transform written scripts into fully animated or live-action video narratives. SORA can be used to generate different versions of a scene with various to explore creative options.
Video editing and merging: Sora can edit or enhance existing videos or merge two related videos seamlessly. This can be used for creative content creation.

What are the alternatives to SORA AI?

Let us see some of the other text-to-video models available.

CogVideo: It is the first open-source pre-trained transformer for text-to-video generation in the general domain. CogVideo builds upon a powerful text-to-image model (CogView2). It is known for High-Frame-Rate Generation. Compared to other text-to-video models, CogVideo is capable of generating videos with more frames per second, resulting in smoother and more realistic motion.
Nuwa: It employs a “diffusion over diffusion” method to train models which utilizes an autoregressive generation mechanism for infinite image and video synthesis from text inputs, enabling the generation of long, HD-quality videos.
Gen2 By Runway: This model is content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. The model provides fine-grained control over output characteristics and customization based on a few reference images.
Google’s Lumiere: Google’s new video generation AI model Lumiere uses a new diffusion model called Space-Time-U-Net, or STUNet, that figures out where things are in a video (space) and how they simultaneously move and change (time)

What are the risks associated with SORA AI?

Malicious content: Sora could be used for creating convincing fake content that are hateful, biased or harmful content.
Societal Impact: Sora could be used for spreading of misinformation impacting target basic fabrics of modern-day society like elections, economics etc.
Deepfake videos: It further raised concerns about deepfake video threats which is already an issue with gen ai technnlogy
Privacy Violations: It could be used to impersonate individuals for purposes not know to them
Identity Theft: Sora could be exploited for various malicious purposes, including identity theft, impersonation, or creating fake accounts for fraudulent activities.

How can I access SORA AI?

Sora is not available to public. It is not yet open source. The access is limited. It has been granted access only to select team of red teamers to assess critical areas for harms or risks. Also access has been given to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professional.

What does OpenAI Sora mean for the future?

With the release of SORA it is renewed the interest in text to video generation technology. It has set a benchmark for the large and small competitors to what can be achieved. In short term we believe many large scale competitors like Google, Meta will upgraded their current text to videl model to match or surpass the capabilities of SORA. It will also fuel the development of open source text to video model.

With the advancement of such start of the art technology , it will have impact on content creation editing works like. Tools like SORA can be used to

Speed up production across entertainment industries.
Assist in prototyping and visualizing storyboard ideas.
Create personalized content tailored to an individual’s tastes and preferences
Deliver visualize enhanced text content in academics

OpenAI’s Safety Measure for SORA AI Model

While the model is SOTA and impressive it raises concerns about transparency, accountability and ethical considerations. Open AI recognizes the possibility of misuse of such an advanced technology and is taking the below steps to address the concerns.

The team is working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model
The team is building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora
The team is leveraging the existing safety methods that have been already built for similar products like using DALL·E 3, which apply to Sora as well.
The team is training the model to reject text input prompts that violate our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others.

SORA AI – Frequently Asked Questions

What is SORA?

Sora is text to image generation multimodal model by Open AI which generates 1-minute-long video from text prompts.

Is SORA accessible?

SORA is not yet open sourced. It is available only to select individuals as of now for feedback. The Open AI team has released few videos showcasing the capabilities of the model on their website.

What is core technology that drives SORA?

Though the exact technical details are not revealed the Open AI team maintains that it is based on diffusion transformer technology.

Article Tags :

AI-ML-DS

Data Science