MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale

Haozhe Liu^1,2, Shikun Liu¹, Zijian Zhou¹, Mengmeng Xu¹, Yanping Xie¹ Xiao Han¹, Juan C. Pérez¹, Ding Liu¹, Kumara Kahatapitiya¹, Menglin Jia¹, Jui-Chieh Wu¹, Sen He¹, Tao Xiang¹, Jürgen Schmidhuber², and Juan-Manuel Pérez-Rúa¹

¹Meta AI ²KAUST

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

Paper

Technical Report

Introduction

We propose a new paradigm for video generation that combines the flexibility of masked auto-regression (MAR) in a continuous space with the robust generative capabilities of diffusion model (DM). Specifically, we present a scalable training recipe and an efficient neural architecture design for video generation. Our model decomposes video generation into two sub-tasks — temporal and spatial modelling — handled by distinct networks with an asymmetric design based on the following two principles:

MAR handles long-range temporal modelling, while DM focuses on detailed spatial modelling.

MAR operates with more parameters at a lower resolution, while DM operates with fewer parameters at a higher resolution.

Following this principle, our model integrates MAR-based planning signals with a DiT-based lightweight, tiny diffusion model, hence the name MarDini. Our empirical study on MarDini highlights the following key characteristics:

Flexibility. With MAR conditioning, MarDini naturally supports a range of video generation tasks through flexible masking strategies. For example, when given the first frame and masking the rest, it performs image- to-video generation; when given a video and masking subsequent frames, it performs video expansion; and, when given the first and last frames and masking the middle frames, it performs video interpolation.

Scalability. MarDini can be trained from scratch at scale, without relying on generative image-based pre-training. In contrast to most video generation models, that treat video as a secondary task following image generation, MarDini leverages mask ratio tuning to progressively adjust the difficulty of the training task. This approach enables the model to scale from video interpolation to full video generation, directly bypassing the need for image-based pre-training.

Efficiency. MarDini's asymmetric design allocates more computational resources to lower resolutions, making it memory-efficient and fast during inference. With lower overall memory usage, MarDini allows the deployment of computationally intensive spatio-temporal attention mechanisms at scale, improving its ability to model complex motion dynamics.

Image-To-Video Results

The primary application of MarDini is image-to-video generation. We demonstrate this capability by using one reference frame placed in the middle position as a conditioning input, and generating 16 additional frames. In the following, we present a wide range of diverse generated videos which contains 17 frames rendered at 8 FPS, producing smooth 2-second videos.

Auto-Regressive Generation for Slow-Motion Videos

By using MAR for high-level planning, MarDini supports auto-regressive inference, generating additional frames beyond those defined in training. We demonstrate this with hierarchical auto-regressive generation: starting with a video, we segment it into multiple clips, expand each clip, and treat the expanded clip as a new video for recursive interpolation. For example, starting with 4 images, MarDini uses a 17-frame window to expand them into a 128-frame slow-motion video (64× expansion). This shows that our model isn't constrained by the training window size, emphasizing its potential for long-range video generation.

Zero Shot 3D View Synthesis

Although trained solely on video data, MarDini shows preliminary spatial understanding, suggesting potential for 3D applications. In the following example, two views of a fixed object serve as the first and last reference frames, while intermediate frames are generated, similar to our video interpolation task. The model effectively produces convincing, 3D-consistent views, showcasing its promise for 3D generation. Notably, no camera motion control signals were used. We plan to explore MarDini's performance on 3D data with better control in future work.

Conclusion

We have introduced a new family of generative models for video, i.e., MarDini, based on auto-regressive diffusion, wherein a large planning model offers powerful conditioning to a much smaller diffusion model. Our design philosophy considers efficiency from model conception, and so our heaviest model component is only executed once at lower resolution inputs, whereas our generative module focuses on fine-grained details at the frame level, reconciling high-level conditioning and image details. Our model is unique in that it leverages a masked auto-regressive loss directly at the frame level. MarDini is afforded with multiple generative capabilities from a single model, e.g., long-term video interpolation, video expansion, and image animation.

Citation

To cite the paper, please use the below:

@article{liu2024mardini, title={MarDini: Masked Autoregressive Diffusion for Video Generation at Scale}, author={Haozhe Liu and Shikun Liu and Zijian Zhou and Mengmeng Xu and Yanping Xie and Xiao Han and Juan C. Pérez and Ding Liu and Kumara Kahatapitiya and Menglin Jia and Jui-Chieh Wu and Sen He and Tao Xiang and Jürgen Schmidhuber and Juan-Manuel Pérez-Rúa}, journal={arXiv preprint arXiv:2410.20280}, year={2024} }