Google’s New AI Can Hallucinate An Entire Video Game In Real Time Without A Game Engine
A deep dive into how Google’s ‘GameNGen’ model works, how it is trained, and how it opens up new avenues for the mind-blowingly fantastic future of AI applications.
Imagine a simulated world run by a powerful AI that generates new scenes and experiences for its inhabitants in real-time.
Google’s recent research pre-print on ArXiv has shocked the world with something similar.
They just introduced GameNGen, the first game engine entirely powered by a neural network.
It is so good that it can generate the entire high-quality environment of an immersive video game called Doom, which players can interact with in real time.
GameNGen uses Reinforcement learning to train an AI agent to play the original game. It then uses these recordings to train a Diffusion model to produce the game's frames for human players in real-time.
Here is a story where we deep dive into how this AI model works, how it is trained, and how it opens up new avenues for the mind-blowingly fantastic future of AI applications.
Let’s First Learn How Computer Games Work
Computer games are software produced around a Game loop that has three basic functions:
Gather user inputs
Update the game state based on these inputs
Render the updated state to the user’s screen pixels
The Game loop runs at such high frame rates that it creates an illusion of an immersive and interactive world for a user.
The core of the Game loop is a Game engine that consists of a set of programmed rules that use a computer’s hardware to emulate them.
Interestingly, this hardware could come from any machine (and not just a conventional computer).
A natural question arises at this point:
If AI can replace a set of rules for many other non-graphical tasks, why can’t it replace a Game engine?
Why Is It Tough To Replace A Game Engine With AI?
We have hundreds of generative AI models available at our fingertips today.
Many of the notable ones are based on the Diffusion model, with examples being:
DALL-E and Stable Diffusion for Text-to-Image generation
Sora for Text-to-Video generation
Simulating a video game can feel like just rapid video generation, but this is not the case.
The process involves handling a user's dynamic interactions and using these inputs to change the game state in real time.
This game state change involves working with intricate logic and physics simulations that must be computed on the fly.
Apart from these reasons, in a computer game, each frame is generated based on the previous frames and actions.
Maintaining such long-term consistency of game states based on these is challenging and prone to instability during generation.
This can be seen clearly in previous research in this area.
But finally, researchers have solved this with GameNGen, which allows games to be easily generated, similar to how images and videos are today.
Let’s learn how this is made possible.
An Overview Of The ‘GameNGen’ Model
It’s time for some Reinforcement learning.
Consider an Interactive Environment (E) for a video game like Doom.
It consists of:
Latent states (
S
): These represent the program’s dynamic memory.Partial projections of these states (
O
): These are the rendered screen pixels.a Projection Function (
V: S -> O
): This maps states to observations (i.e. the game’s rendering logic).a set of actions (
A
): These include key presses and mouse movements.a Transition Probability Function (
p
): This controls how the game’s states change based on the player’s input/ actions.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.