Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

March 2024

tl;dr: First video generation pipeline based on latent space.

Overall impression

Two main advantages of video LDM is the computationally efficiency, and ability to leverage pretrained image diffusion models. Video LDM leverages pretrained image DMs and them into vido generators by inserting temporal layers to enforece temporally coherent reconstruction. It is the first video diffusion model in latent space rather than in pixel space.

Diffusion models offer a robust and scable training objectave and are typically less parameter intensive than their transformer based counterparts.

Latent diffusion models works in a compressed lower dim latent space and thus makes the task of high-res video generation more tractable.

It is also cited by Sora as one comparison baseline. Video LDM is widely used in research projects due to its simplicity and compute efficiency.

The temporal consistency of the long drive video is still NOT good, without fixed appearances for a given object. Similar to that in Drive into the Future. –> This is significantly improved by SVD: Stable Video Diffusion, which is a native video model.

Video generation displays multimodality, but not controllability (it is conditioned on simple weather conditions, and crowdedness, and optionally bbox). In this sense it is NOT a world model.

Key ideas

Technical details