I tried to summarize the paper of "Make-A-Video", an AI that generates videos from Meta text.

2/19/2023

Using OpenAI's GPT prompt, create a "PaperSummarizer"

I asked you to summarize my Make-A-Video paper.

『Make-A-Video: Text-to-Video Generation without Text-Video Data』https://arxiv.org/abs/2209.14792

The following is the full summary result

INTRODUCTION

Collecting a large number of alt-text, image pairs from HTML pages from the Internet has led to a recent breakthrough in text-to-image modeling.
However, similar success is limited to replicating for video, and the (text, video) dataset cannot be easily collected.
We thought it would be useless to train a text-to-video (T2V) model from scratch if we already had a model that could generate existing images.
In addition, unsupervised learning allows networks to learn overwhelmingly large quantities of data.
This amount of large data is important for learning relatively subtle and uncommon concepts around the world, and unsupervised learning has helped advance the field of natural language processing.
In response, we propose Make-A-Video: a T2I model that learns the relationship between the text and the visual world (unpaired) and also learns real motion using unsupervised learning of video data, without using a text-video parent guide, achieves the upper level of video generation with time axis when text is entered – Make = A = Video Proposal – The Study of Time World Dynamics of Multiple Video Collections

PREVIOUS WORK

Reed et al., which extended GAN as a study of text-to-image generation (T2I), Zhang et al., which does 2016 and progressive generation, 2017, Xu et al., to improve text-image alignment, 2018, and more recently Make-A-Scene (Gafni et al., 2022) and Parti (Yu et al., 2022a) and other proposals.
As a study on text-to-video generation (T2V), Mittal et al. 2017, Pan et al. 2017, Li et al. 2018, Gupta e t a l . 2 0 1 8 , L i u e t a l . 2 0 1 9 b has been done mainly in simple dimensions (e.g., moving numbers or certain human movements). Sync-DRAW (Mittal etal., 2017) was the first T2V generation method using VAE and recurrent attention. GODIVA (Wuetal., 2021a), NÜWA (Wuetal., 2021b), and CogVideo (Hongetal., 2022) are additional variations.
- Although simplifying video generation using image information is discussed, Make-A-Video is an architecture independent of video generation from text, unlike previous research, and also exceeds the inferiority of VDM (Hoetal., 2022) by adopting highly adaptable weight adjustment of the T2I model and the adoption of 3D convolution models and temporal attention layers.

METHOD

Make-A-Video consists of three main components: (i) a based T2I model (Sec. 3.1), (ii) using space-time convolution and attention layers to expand the structural blocks of the network into the time dimension (Sec. 3.2), a frame completion network for the generation of high frame rates, which is also an important element required for T2V generation (Sec. 3.3）。 The final T2V inference scheme for Make-A-Video is as follows (Fig. 2) : ŷt generated video, SR h, SR l is a spatial and space-time super-resolution network (Sec. 3.2), ↑ F is a frame completion network (Sec.. 3,3), Dt is a space-time decoder { Sec.. 3,2) and P are priorated ( Sec.. 3,1 ) x is the BPE encoded text C x is the CLIP text encoder (Radford et al., 2021), where x is the input text. These three main parts are described in detail in the following sections.

TEXT-TO-IMAGE MODEL

Previous studies (Ramesh et al., 2022), we built a T2I model to train the backbone of the learning model before adding a temporal summary.
To generate high-resolution images from text, the following networks were used: (i) a pre-network P that generates image embedding y e according to BPE-encoded text tokens x and text embedding x e e, (ii) a decoder network that generates low-resolution (64 × 64 RGB) images ŷl according to image embedding y e, (iii) The two super resolution networks SR l and SR h determine the final generated image ŷ 256 × 256 or 768 × 768 pixels.

SPATIOTEMPORAL LAYERS

In order to extend the 2D conditional network to the temporal dimension, we will modify two important building blocks that require a spatial and temporal dimension to generate video: (i) convolutional layer (section 3.2.1) and (ii) attention layer (section 3.2.2). Other layers, such as fully connected layers, do not require special handling to add additional dimension. They are independent of structured spatial and temporal information. Time modules are used in the majority of U-Net-based spread networks. The spatial and temporal decoder D t generates 16 RGB frames of 64 × 64 size, and the newly added frame completion network ↑ F interpolates for 16 generated frames to increase the effective frame rate (Fig. 2), SR t l Superold network is also used. Note that SR h is difficult to extend to the dimension of time and may also be due to memory computing limitations.

PSEUDO-3D CONVOLUTIONAL LAYERS

Using the separable convolution proposed by Chollet (2017) as a motivation, as shown in Fig. 3, 1D convolution is superimposed after the 2D convolutional layer to suppress the large computational complexity of the 3D convolutional layer and share information on the spatial axis and the time axis. In addition, from the difference between the existing trained 2D convolutional layer and the newly initialized 1D tatami input layer, it is possible to thoroughly learn the new tatami input while maintaining the existing spatial information. For the input tensor h ∈ R B×C×F ×H×W, the Pseudo-3D tatami input layer is defined as: Each sputterly dimension is swapped using a transpose operator •T. For smooth initialization, the Conv 2D layer is initialized from the pre-trained T2I model, and the Conv 1d layer is initialized as an identity function – which also contains K screens (noise random) generated from a specific text – allowing a stable transition from estimation of a single space to spatio-temporal layers.

PSEUDO-3D ATTENTION LAYERS

The attention layer, an important part of a T2I network, injects text information into multiple network layers along with other relevant information such as diffusion time steps.
Since using a 3D convolutional layer is computationally intensive, the time dimension of the attention layer was abandoned due to memory consumption, but (Ho et al., 2022), we applied the previous dimensionality division strategy to the attention layer.
Each (pre-trained) spa

FRAME INTERPOLATION NETWORK

In addition to making the spatio temporal changes discussed in section 3.2, we will train the masked frame interpolation and extension network F for frame interpolation and expansion.
To increase the frame rate within memory and compute constraints, spatiotemporal decoder D t is fine-tuned with a zero-padded masked input frame with a 3-channel RGB-masked video input at the U-Net input and an additional binary channel to indicate which frames are masked.
Use ↑F to extend the given video tensor with masked frame interpolation.
All experiments upsummed 16 phages ((16-1) × 5 + 1) at frame skip 5.
The same architecture can be used to enlarge a video or animate an image, and can be achieved by masking only the fame at the beginning and end of the video.

TRAINING

The various components of Make-A-Video are trained independently.
The only component that receives text as input is a prior P. Train with text and image data pairs, and don't tweak the video.
The decoder, pre-two, super-resolution components are first trained on images only (no text at the same time).
Add a temporal layer, initialize it, and fine-tune it on the labeled video data.
16 random frames from the range between 1 and 30fps are sampled from the original video. Sampling is performed using the beta function, and when the decoder is trained, it starts with the high FPS range (less motion) and changes to the "low FPS range (more movement)".
The masked frame interpolation portion is fine-tuned from the temoral decoder.

DATASETS AND SETTINGS

As an image model for training, we use a 2.3B subset of the Schuhmann et al. dataset.
NSFW images, toxic words in text, and images with a watermark probability of 0.5 or higher are excluded.
WebVid-10M was used to perform zero-shot (untrained) evaluation on MSR-VTT.
UCF-101 and MSR-VTT performed zero-shot evaluations. Frechet Video Distance (FVD) and Inception Score (IS) were adapted to 10K samples in UCF-101. The average values of Frechet Inception Distance (FID) and CLIPSIM were determined for MSR-VTT. A human everage set was created by collecting 300 prompts from AMT and selecting prompts from five categories (animals, fantasy, humans, nature, and food). Also collected from DrawBench's Imagen

QUANTITATIVE RESULTS

Automated evaluation of MSR-VTT and UCF-101 showed that Make-A-Video was superior to GODIVA and N ÜWA, and superior to CogVideo.
Based on human evaluation, Make-A-Video achieved higher video quality and text-video fidelity than CogVideo in DrawBench and test sets. FILM (Reda et al., 2022), we found that realistic movements were generated with a frequency of 62% (test set) to 54% (DrawBench).

QUALITATIVE RESULTS

Figure 1 shows an example of Make-A-Video generation.
CogVideo(Hong et al., 2022) and VDM (Ho et al., 2022) and T2V generation comparison, FILM (Reda et al., 2022) and video interpolation comparison.
It can also be used for other tasks such as image animation and video mutation.
Figure 4 (c) shows a comparison with FILM (Reda et al., 2022) in the interpolation task between two images.
Our model produces more semantically meaningful interpolation than FILM, but FILM tends to transition smoothly between frames without semantic world understanding. Figure 4 (d) shows an example of video mutation.
Generate semantically similar videos subject to the average CLIP embedding of all frames from the video. You can see other video generation examples and application methods on make-a-video.github.io

DISCUSSION

The greatest strength of human intelligence is that it learns from the world around it.
Generative systems can be more creative and useful if they mimic the way humans learn.
To break away from reliance on labeled data, it is helpful to use unsupervised learning from several levels of video.
Technical limitations will be addressed, and the generation of long videos that contain inferential inferences in text and video is also a challenge for the future.
The dataset is published with NSFW content and harmful words removed, maintaining the transparency of the model.