Back to Foundations of World Models

World Foundation Models (WFMs)

20 min

World Foundation Models (WFMs)

World Foundation Models (WFMs) are a specialized class of world models that meet the scale and generalizability requirements of foundation models. Like GPT for language or CLIP for vision-language, WFMs serve as pretrained base models for physical AI applications.

What Makes a Foundation Model?

Foundation models share key characteristics:

  1. Scale: Trained on massive datasets (petabytes of video/images)
  2. Generalizability: Adaptable to many downstream tasks
  3. Transfer Learning: Knowledge transfers across domains
  4. Emergent Capabilities: New abilities emerge at scale

Notable World Foundation Models

NVIDIA Cosmos

NVIDIA Cosmos is a family of world foundation models designed for physical AI:

  • Cosmos Tokenizer: Converts video to compact tokens
  • Cosmos World Models: Generate physically plausible video
  • Cosmos Transfer: Style and domain transfer capabilities

OpenAI Sora

Sora demonstrates world simulation through video generation:

"Sora is a diffusion model... We find that video models exhibit a number of interesting emergent capabilities when trained at scale, including 3D consistency, long-range coherence, and object permanence."

Google Genie

Genie models enable interactive environment generation:

  • Genie 1: Learned to generate playable 2D games from images
  • Genie 2: Extended to 3D environments
  • Genie 3: General-purpose world model for diverse environments

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                  World Foundation Model                  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Encoder    │  │   Dynamics   │  │   Decoder    │  │
│  │  (Tokenizer) │→ │    Model     │→ │ (Generator)  │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
│         ↑                 ↑                 ↓          │
│    Video/Image      Action/Text        Video/Image     │
│      Input          Conditioning         Output        │
└─────────────────────────────────────────────────────────┘

Post-Training for Specific Tasks

WFMs can be specialized through post-training:

ApproachDescriptionUse Case
UnsupervisedAdapt using unlabeled domain dataDomain adaptation
SupervisedFine-tune with labeled examplesTask-specific behavior
RL Fine-tuningOptimize for reward signalsRobot policy learning

Code Example: Using a Pretrained WFM

python
from cosmos import CosmosWorldModel, CosmosTokenizer

# Load pretrained model
tokenizer = CosmosTokenizer.from_pretrained("nvidia/cosmos-tokenizer")
world_model = CosmosWorldModel.from_pretrained("nvidia/cosmos-wfm-base")

# Encode input video
video_tokens = tokenizer.encode(input_video)

# Generate future frames conditioned on action
action_embedding = encode_action("move forward")
future_tokens = world_model.generate(
    video_tokens,
    action=action_embedding,
    num_frames=16
)

# Decode back to video
generated_video = tokenizer.decode(future_tokens)

Benefits of WFMs

  1. Reduced Development Time: Start from pretrained model instead of scratch
  2. Data Efficiency: Require less task-specific data
  3. Consistent Quality: Inherit robust physical understanding
  4. Scalability: Benefit from continued pretraining improvements

Summary

World Foundation Models democratize access to physical AI capabilities. By providing pretrained models that understand world dynamics, they enable developers to build robotics, autonomous systems, and simulation applications without collecting petabytes of data or spending millions on compute.

Knowledge Check

Test your understanding with 2 questions