What Are World Models?

15 min

What Are World Models?

World models are neural networks that understand the dynamics of the real world, including physics and spatial properties. They can use input data—including text, image, video, and movement—to generate videos that simulate realistic physical environments.

Key Characteristics

World models have several defining characteristics:

Physical Understanding: They comprehend physics, gravity, collisions, and material properties
Spatial Reasoning: They understand 3D space, depth, and object relationships
Temporal Coherence: They maintain consistency across time in generated sequences
Multimodal Input: They can process text, images, video, and sensor data

Why World Models Matter

"World models will unlock AI for tangible, real-world experiences, extending generative AI beyond the confines of 2D software." — NVIDIA

Traditional AI systems operate in digital domains—text, images, code. World models bridge the gap to the physical world, enabling:

Robotics: Robots can "imagine" outcomes before acting
Autonomous Vehicles: Cars can predict traffic scenarios
Video Generation: AI can create physically plausible videos
Game Development: Procedural world generation with realistic physics

Historical Context

The concept of world models has roots in cognitive science and reinforcement learning:

1990s: Early work on mental simulation in cognitive science
2018: "World Models" paper by Ha & Schmidhuber introduced VAE-RNN architecture
2022: Video prediction models like DALL-E and Imagen emerged
2024: OpenAI's Sora demonstrated video generation as world simulation
2025: NVIDIA Cosmos and Google Genie 3 advanced foundation world models

Types of World Models

Type	Description	Applications
Prediction Models	Generate future states from current observations	Video synthesis, motion planning
Style Transfer Models	Transform inputs while preserving structure	Digital twins, reconstruction
Reasoning Models	Analyze and make decisions over time	Robot planning, logistics

Code Example: Simple World Model Concept

python

import torch
import torch.nn as nn

class SimpleWorldModel(nn.Module):
    """A conceptual world model architecture"""
    
    def __init__(self, state_dim, action_dim, latent_dim=256):
        super().__init__()
        # Encoder: compress observations to latent space
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        # Dynamics model: predict next latent state
        self.dynamics = nn.Sequential(
            nn.Linear(latent_dim + action_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        # Decoder: reconstruct observations
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, state_dim)
        )
    
    def forward(self, state, action):
        z = self.encoder(state)
        z_next = self.dynamics(torch.cat([z, action], dim=-1))
        state_pred = self.decoder(z_next)
        return state_pred, z_next

import torch
import torch.nn as nn

class SimpleWorldModel(nn.Module):
    """A conceptual world model architecture"""
    
    def __init__(self, state_dim, action_dim, latent_dim=256):
        super().__init__()
        # Encoder: compress observations to latent space
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        # Dynamics model: predict next latent state
        self.dynamics = nn.Sequential(
            nn.Linear(latent_dim + action_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        # Decoder: reconstruct observations
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, state_dim)
        )
    
    def forward(self, state, action):
        z = self.encoder(state)
        z_next = self.dynamics(torch.cat([z, action], dim=-1))
        state_pred = self.decoder(z_next)
        return state_pred, z_next

Summary

World models represent a paradigm shift in AI—from pattern recognition to world understanding. They enable AI systems to simulate, predict, and reason about physical environments, opening new possibilities in robotics, autonomous systems, and creative applications.

Knowledge Check

Test your understanding with 2 questions

Module Lessons

1What Are World Models?2World Foundation Models (WFMs)3Physics Understanding in World Models