A Good Image Generator Is What You Need for High-Resolution Video Synthesis

1Rutgers University, 2Snap Inc., 3University of Delaware

ICLR 2021, Spotlight

In- and Cross- Domain Video Generation

Abstract

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving imagebased models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic.

We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques.

32-Frames

FaceForensics video generation. We generate long sequences by unrolling the 16-frame trained motion generator for 32 steps.

64-Frames

FaceForensics video generation. We generate long sequences by unrolling the 16-frame trained motion generator for 64 steps.

32-Frames

(AFHQ-Dog, VoxCeleb) cross-domain video generation (512x512). We interpolate 16-frame video to get 32 frames.

Motion Diversity

Each row indicates synthesis diverse motion with the same content.

Content Diversity

Each row indicates applying same motion to different content codes.

BibTeX


@inproceedings{
    tian2021a,
    title={A Good Image Generator Is What You Need for High-Resolution Video Synthesis},
    author={Yu Tian and Jian Ren and Menglei Chai and Kyle Olszewski and Xi Peng and Dimitris N. Metaxas and Sergey Tulyakov},
    booktitle={International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=6puCSjH3hwA}
}