Google’s GameNGen: Bringing Real-Time Game Simulation to Life with Neural Models

Synced

2 years ago

In recent years, generative models have made substantial strides in creating images and videos based on multi-modal inputs. At the forefront of this advancement, diffusion models have become the standard for media generation. However, while video generation has seen significant progress, simulating the interactive environments of video games—especially in real-time and at high quality—remains a complex challenge.

To tackle this, a research team from Google Research, Tel Aviv University, and Google DeepMind introduced a groundbreaking approach in their paper titled Diffusion Models Are Real-Time Game Engines. They present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with complex environments over extended sequences, maintaining high-quality output.

GameNGen is a generative diffusion model that learns to simulate interactive game environments. To collect training data for this model, the researchers first train an agent model that interacts with the environment using a teacher-forcing objective. This interaction model is then followed by the training of the generative model. The actions and observations recorded during the agent’s training are compiled into a dataset that serves as the training foundation for the generative model.

The training process begins with a pretrained checkpoint from Stable Diffusion 1.4, where all U-Net parameters are unfrozen. The researchers employ a batch size of 128, a constant learning rate of 2e-5, and use the Adafactor optimizer without weight decay, applying a gradient clipping value of 1.0. Additionally, they modify the diffusion loss function to utilize v-prediction and incorporate noise augmentation with a maximum noise level of 0.7, alongside 10 embedding buckets. The training data comprises all agent-generated trajectories from Reinforcement Learning as well as evaluation data collected during training.

Empirical results show that GameNGen is capable of interactively simulating the classic game DOOM at over 20 frames per second on a single TPU. Next-frame prediction using the model yields a Peak Signal-to-Noise Ratio (PSNR) of 29.4, which is on par with the quality of lossy JPEG compression.

In conclusion, GameNGen represents a significant leap toward a new era of game engines, one where interactive worlds are autonomously generated by neural models—just as images and videos are today. This work opens the door to fully AI-driven game development, fundamentally transforming how virtual environments are created and experienced.

The demo is available on project’s Github.io. The paper Diffusion Models Are Real-Time Game Engines is on arXiv.

Author: Hecate He | Editor: Chain Zhang

Share this: