Contents
1. 서론
문제 정의 및 목표
시각 데이터의 통합 표현을 통해 대규모 생성 모델 학습을 최적화하고자 한다. 본 논문에서는 다음 두 가지 주요 영역을 다룬다.
선행 연구 및 비교
이전의 연구들은 비디오 데이터 생성 모델링을 위해 다양한 접근 방식을 사용해 왔다.
이런 방법들은 주로 특정 범주의 시각 데이터를 대상으로 하며, 짧은 비디오나 고정된 크기의 비디오에 중점을 두었다.
2. Turning Visual Data into Patches
대규모 언어모델의 성공은 인터넷 규모의 데이터를 훈련하여 일반적인 역량을 얻는 데서 비롯되었으며, 이는 다양한 텍스트 모달리티를 통합하는 토큰 사용을 통해 가능해졌다. 본 연구에서는 이런 이점을 시각 데이터 생성 모델이 어떻게 계승할 수 있는지 탐구한다. 언어 모델은 텍스트 토큰을 사용하지만, Sora는 시각 패치를 사용한다. 패치는 시각 데이터 모델의 효과적인 표현으로 입증되었다.
3. Implementation of Patches
3.1 비디오를 패치로 변환하는 과정
비디오를 낮은 차원의 잠재 공간으로 압축한 후, 이 표현을 시공간 패치로 분해한다. 이는 고해상도의 데이터를 보다 효율적으로 처리할 수 있게 한다.
3.2 비디오 압축 네트워크 (Video Compression Network)
비디오의 차원을 축소하는 네트워크를 훈련시킨다. 이 네트워크는 원시 비디오를 입력으로 받아 시간적 및 공간적으로 압축된 잠재 표현을 출력한다. Sora는 이 압축된 잠재 공간에서 훈련되고, 생성된 잠재 표현을 다시 픽셀 공간으로 매핑하는 디코더 모델을 사용한다.
3.3 시공간 잠재 패치 (Spacetime Latent Patches)
압축된 입력 비디오에서 시공간 패치 시퀀스를 추출하여 트랜스포머 토큰으로 사용한다. 이 방법은 이미지에도 적용할 수 있으며, 이미지를 단일 프레임 비디오로 간주한다. 패치 기반 표현은 다양한 해상도, 기간 및 화면 비율의 비디오와 이미지를 훈련할 수 있게 한다.
\[\text{Given a compressed video, } V_{compressed}, \text{ we extract spacetime patches: } \{P_i\}_{i=1}^n.\]각 패치 $P_i$는 $V_{compressed}$의 부분 공간을 나타내며, 이 패치는 트랜스포머 토큰으로 사용된다. 이 접근 방식은 이미지에도 적용 가능하며, 이미지를 단일 프레임 비디오로 간주하여 처리한다.
4. 트랜스포머의 스케일링 (Scaling Transformers for Video Generation)
4.1 확산 모델 (Diffusion Model)
Sora는 입력 노이즈 패치와 텍스트 프롬프트와 같은 조건 정보를 기반으로 원래의 “깨끗한” 패치를 예측하도록 훈련된 확산 모델을 사용한다. 중요한 점은 Sora가 확산 트랜스포머라는 점이다.
Let \(x_0\) be the original patch and \(x_t\) be the noisy patch at timestep \(t\). The diffusion process is defined as:
\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t} x_{t-1}, (1 - \alpha_t) I)\]트랜스포머는 다양한 도메인에서 향상된 확장성을 보여주었으며, 시각 데이터 생성에도 유효하다. 본 연구에서는 확산 트랜스포머가 비디오 모델로서 효과적으로 확장됨을 발견했다.
5. 실험 설정 및 결과 (Experimental Setup and Results)
5.1 기본 컴퓨팅 (Base Compute)
4배, 32배의 컴퓨팅 자원을 사용하여 다양한 기간, 해상도, 화면 비율의 비디오를 훈련한다. 기존 접근 방식은 비디오를 표준 크기로 조정, 자르기 또는 트림하는 반면, 본 연구는 데이터의 원본 크기에서 훈련하는 것이 여러 이점을 제공함을 발견했다.
5.2 샘플링 유연성 (Sampling Flexibility)
Sora는 1920 x 1080p의 와이드스크린 비디오, 1080 x 1920의 세로 비디오 등 다양한 비디오를 생성할 수 있다. 이는 Sora가 다양한 기기의 원본 화면 비율로 콘텐츠를 생성할 수 있게 하며, 낮은 해상도로 빠르게 프로토타이핑한 후 전체 해상도로 생성할 수 있게 한다.
5.3 향상된 프레이밍 및 구성 (Improved Framing and Composition)
원본 화면 비율에서 훈련하면 구성과 프레이밍이 개선됨을 실험적으로 발견했다. 정사각형 크롭으로 훈련된 모델은 피사체가 부분적으로만 보이는 비디오를 생성할 수 있는 반면, Sora는 개선된 프레이밍을 제공한다.
Let \(V_{\text{native}}\) be a video at its native aspect ratio, and \(V_{\text{square}}\) be the same video cropped to a square.
\[\text{Framing Quality}(V_{native}) > \text{Framing Quality}(V_{square})\]5.4 언어 이해 (Language Understanding)
텍스트-비디오 생성 시스템을 훈련하려면 많은 비디오와 해당 텍스트 캡션이 필요하다. DALL·E 3에서 도입된 리캡셔닝 기술을 비디오에 적용하여 훈련 세트의 모든 비디오에 대한 텍스트 캡션을 생성한다. 이를 통해 텍스트 충실도와 비디오 전반의 품질이 향상된다.
6. 결론 (Conclusion)
Sora의 방법은 샘플링 유연성, 프레이밍 및 구성 개선, 텍스트-비디오 생성의 언어 이해 향상에서 상당한 발전을 이룬다. 이 포괄적인 접근 방식은 사용자의 프롬프트를 정확하게 따르는 고품질 비디오를 생성할 수 있게 하여, 다양한 콘텐츠 유형과 형식에 대한 모델의 적응력을 보여준다.
\[\text{Overall Quality}(Sora) > \text{Overall Quality}(Previous Methods)\]Sora는 시각 데이터 생성 모델링 분야에서 중요한 진전을 나타내며, 다양한 고품질 시각 콘텐츠를 생성하는 유연하고 강력한 도구를 제공한다.
Overview
This report delves into two primary areas:
Background and Comparison
Prior studies on generative modeling of video data have employed various approaches, such as:
These methods typically target a specific category of visual data, focusing on shorter videos or videos of a fixed size.
Turning visual data into patches
We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data. [13], [14] The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.[15], [16], [17],[18]
We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
Figure Patches
At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.
Video compression network
We train a network that reduces the dimensionality of visual data. [20] This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.
Spacetime latent patches
Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.
Scaling transformers for video generation
Sora is a diffusion model [21], [22], [23], [24], [25]; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,[13], [14] computer vision, [15], [16], [17], [18] and image generation. [27], [28], [29]
Figure Diffusion
In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.
Base compute
4x compute, 32x compute, variable durations, resolutions, aspect ratios Past approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
Sampling flexibility
Sora can sample widescreen 1920 x 1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
Improved framing and composition
We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.
Language understanding
Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos. Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.
Approach
Sora stands out as a generalist model for visual data, offering several key advantages:
Turning Visual Data into Patches
Inspired by the success of large language models (LLMs) that train on diverse internet-scale data to acquire general capabilities across various domains including code, mathematics, and multiple natural languages, this work explores the adaptation of a similar approach to generative models of visual data. The core idea revolves around the concept of “patches,” akin to the text tokens in LLMs, but designed for the visual domain.
Implementation of Patches
This approach underscores the potential of borrowing conceptual frameworks from the realm of language processing and applying them to the visual domain, paving the way for more advanced and capable generative models.
Video Compression Network
Spacetime Latent Patches
Scaling Transformers for Video Generation
Key Features and Innovations
Conclusion
[1] Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. “Unsupervised learning of video representations using lstms.” International conference on machine learning. PMLR, 2015.
[2] Chiappa, Silvia, et al. “Recurrent environment simulators.” arXiv preprint arXiv:1704.02254 (2017).
[3] Ha, David, and Jürgen Schmidhuber. “World models.” arXiv preprint arXiv:1803.10122 (2018).
[4] Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Generating videos with scene dynamics.” Advances in neural information processing systems 29 (2016).
[5] Tulyakov, Sergey, et al. “Mocogan: Decomposing motion and content for video generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[6] Clark, Aidan, Jeff Donahue, and Karen Simonyan. “Adversarial video generation on complex datasets.” arXiv preprint arXiv:1907.06571 (2019).
[7] Brooks, Tim, et al. “Generating long videos of dynamic scenes.” Advances in Neural Information Processing Systems 35 (2022): 31769-31781.
[8] Yan, Wilson, et al. “Videogpt: Video generation using vq-vae and transformers.” arXiv preprint arXiv:2104.10157 (2021).
[9] Wu, Chenfei, et al. “Nüwa: Visual synthesis pre-training for neural visual world creation.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
[10] Ho, Jonathan, et al. “Imagen video: High definition video generation with diffusion models.” arXiv preprint arXiv:2210.02303 (2022).
[11] Blattmann, Andreas, et al. “Align your latents: High-resolution video synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[12] Gupta, Agrim, et al. “Photorealistic video generation with diffusion models.” arXiv preprint arXiv:2312.06662 (2023).
[13] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
[14] Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
[15] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
[16] Arnab, Anurag, et al. “Vivit: A video vision transformer.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[17] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[18] Dehghani, Mostafa, et al. “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.” arXiv preprint arXiv:2307.06304 (2023).
[19] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[20] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
[21] Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. PMLR, 2015.
[22] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
[23] Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International Conference on Machine Learning. PMLR, 2021.
[24] Dhariwal, Prafulla, and Alexander Quinn Nichol. “Diffusion Models Beat GANs on Image Synthesis.” Advances in Neural Information Processing Systems. 2021.
[25] Karras, Tero, et al. “Elucidating the design space of diffusion-based generative models.” Advances in Neural Information Processing Systems 35 (2022): 26565-26577.
[26] Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[27] Chen, Mark, et al. “Generative pretraining from pixels.” International conference on machine learning. PMLR, 2020.
[28] Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021.
[29] Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5.
[30] Betker, James, et al. “Improving image generation with better captions.” Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf 2.3 (2023): 8
[31] Ramesh, Aditya, et al. “Hierarchical text-conditional image generation with clip latents.” arXiv preprint arXiv:2204.06125 1.2 (2022): 3.
[32] Meng, Chenlin, et al. “Sdedit: Guided image synthesis and editing with stochastic differential equations.” arXiv preprint
After reading the papers below, the architecture here should start to make sense. The technical report is a 10,000 foot view and my hope is that each paper will zoom into different aspects and paint the full picture. There is a nice literature review called “Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models” that gives a high level diagram of a reverse engineered architecture.
The team at OpenAI states that Sora is a “Diffusion Transformer” which combines many of the concepts listed in the papers above, but applied applied to latent spacetime patches generated from video.
This is a combination of the style of patches used in the Vision Transformer (ViT) paper, with latent spaces similar to the Latent Diffusion Paper, but combined in the style of the Diffusion Transformer. They not only have patches in width and height of the image but extend it to the time dimension of video.
It’s hard to say how exactly they collected the training data for all of this, but it seems like a combination of the techniques in the Dalle-3 paper as well as using GPT-4 to elaborate on textual descriptions of images, that they then turn into videos. Training data is likely the main secret sauce here, hence has the least level of detail in the technical report.
Use Cases There are many interesting use cases and applications for video generation technologies like Sora. Whether it be movies, education, gaming, healthcare or robotics, there is no doubt generating realistic videos from natural language prompts is going to shake up multiple industries. The note at the bottom of this diagram rings true for us at Oxen.ai. If you are not familiar with Oxen.ai we are building open source tools to help you collaborate on and evaluate data the comes in and out of machine learning models. We believe that many people need visibility into this data, and that it should be a collaborative effort. AI is touching many different fields and industries and the more eyes on the data that trains and evaluates these models, the better.
Check us out here: Oxen.ai