Imagine if games were completely live-generated by an AI model : the NPCs and their dialogues, the storyline, and even the game environment. The playerโs in-game actions would have a real, lasting impact on the game story.
In a very exciting paper, Google researchers just gave us the first credible glimpse of this future.
โก๏ธย They created GameNGen, the first neural model that can simulate a complex 3D game in real-time. They use it to simulate the classic game DOOM running at over 20 frames per second on a single TPU, with image quality comparable to lossy JPEG compression. And it feels just like the true game!
Here's how they did it: 1. They trained an RL agent to play DOOM and recorded its gameplay sessions. 2. They then used these recordings to train a diffusion model to predict the next frame, based on past frames and player actions. 3. During inference, they use only 4 denoising steps (instead of the usual dozens) to generate each frame quickly.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐: ๐ฎ๐ค Human players can barely tell the difference between short clips (3 seconds) of the real game or the simulation ๐ง The model maintains game state (health, ammo, etc.) over long periods despite having only 3 seconds of effective context length ๐ They use "noise augmentation" during training to prevent quality degradation in long play sessions ๐ The game runs on one TPU at 20 FPS with 4 denoising steps, or 50 FPS with model distillation (with some quality loss)
The researchers did not open source the code, but I feel like weโve just seen a part of the future being written!