OpenAI Announces Sora, a Text-to-Video Generator


We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

I am linking to the research page for Sora; the overview page has other examples but is less descriptive. Unfortunately, reading this research is difficult because, for me in Safari, the many lazy loading embedded videos cause the scroll position to move around unexpectedly.

The products of Sora are far more impressive than this janky webpage suggests. It is hard not to be in awe of how far these systems have progressed and what they are now able to do — from whole-cloth generation to more nuanced examples like extending the runtime or changing a video’s setting.