Genie: Generative Interactive Environments

March 2024

tl;dr: A 11B world model trained in an unsupervised manner from unlabeled internet videos.

Overall impression

The main innovation is the training of a latent action model (LAM) that is trained via unsupervised learning. Basically it does discrete clustering of continuous actions. Action conditioned data is hard to obtain. Genie unlocks the potential of using almost limitless online videos for training a world model.

The tech report is very concisely written, as other reports from DeepMind, and tons of useful details for training in the appendix.

The model differs from GAIA-1 in that GAIA-1 still uses video data with action and text annotation. Architecture-wise, GAIA-1 uses a dedicated video decoder based on diffusion model, but Genie uses the decoder of the tokenizer. –> Maybe this can explain the poor image quality.

LAM is more generalized than the IDM model in VPT, where some data are labeled first, then the action predictor used to pseudo-label large sets of unlabeled data. –> Yet in a narrow domain such as autonomous driving. This may also be possible.

The way to learn two networks at the same time is like the self-supervsed depth estimation paper SfM learner.

A world model enables next-frame prediction that is conditioned on action inputs. Genie is a foundation world model, and can be used for training generalist agents without direct environment experience at agent training time.

Key ideas

Technical details