March 2024
tl;dr: A neural simulator with disentangled latent space, based on GAN encoder and RNN-style dynamics model.
DriveGAN uses a VAE to map pixels into a latent space. GAN-style adversarial training is used to train the VAE, thus the name driveGAN.
The proposed architecture is a very general one for a world model, actually very similar to more recent works such as GAIA-1 and Genie. The original World Model by Schmidhuber is also based on VAE and RNN. Over the years, the encoder/decoder can evolved from VAE + GAN to VQ-VAE + diffusion model, and the dynamics model evolved from RNN to Transformer-based GPT-like next-token prediction. It is interesting to see how new techniques shine in the relatively old (although only two-year old) framework. Two advances: more powerful and scalable modules and much much more data.
The main innovation of the paper is the disentanglement of latent space representation into spatial-agnostic theme and spatial-aware contents in encoding stage, and further disentange content action-dependent and action-independent in dynamics engine.
The controllability of driveGAN is achieved via careful architecture design. More modern and scalable approach relies on scalability and more unified interfaces (e.g. natural langauge).
Action of the agent is recovered by training another model, and then used to reproduce the scene. This is similar to the idea of VPT. In a way, it verifies the controllability and geometric consistency of the simulation.
The paper has very nice discussion regarding what a neural simulator should look like. First, generation has to be realistic, and second, they need to be failthful to the action sequence used to produce them.