World Foundation Models (WFMs)

20 min

World Foundation Models (WFMs)

World Foundation Models (WFMs) are a specialized class of world models that meet the scale and generalizability requirements of foundation models. Like GPT for language or CLIP for vision-language, WFMs serve as pretrained base models for physical AI applications.

What Makes a Foundation Model?

Foundation models share key characteristics:

Scale: Trained on massive datasets (petabytes of video/images)
Generalizability: Adaptable to many downstream tasks
Transfer Learning: Knowledge transfers across domains
Emergent Capabilities: New abilities emerge at scale

Notable World Foundation Models

NVIDIA Cosmos

NVIDIA Cosmos is a family of world foundation models designed for physical AI:

Cosmos Tokenizer: Converts video to compact tokens
Cosmos World Models: Generate physically plausible video
Cosmos Transfer: Style and domain transfer capabilities

OpenAI Sora

Sora demonstrates world simulation through video generation:

"Sora is a diffusion model... We find that video models exhibit a number of interesting emergent capabilities when trained at scale, including 3D consistency, long-range coherence, and object permanence."

Google Genie

Genie models enable interactive environment generation:

Genie 1: Learned to generate playable 2D games from images
Genie 2: Extended to 3D environments
Genie 3: General-purpose world model for diverse environments

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                  World Foundation Model                  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Encoder    │  │   Dynamics   │  │   Decoder    │  │
│  │  (Tokenizer) │→ │    Model     │→ │ (Generator)  │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
│         ↑                 ↑                 ↓          │
│    Video/Image      Action/Text        Video/Image     │
│      Input          Conditioning         Output        │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                  World Foundation Model                  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Encoder    │  │   Dynamics   │  │   Decoder    │  │
│  │  (Tokenizer) │→ │    Model     │→ │ (Generator)  │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
│         ↑                 ↑                 ↓          │
│    Video/Image      Action/Text        Video/Image     │
│      Input          Conditioning         Output        │
└─────────────────────────────────────────────────────────┘

Post-Training for Specific Tasks

WFMs can be specialized through post-training:

Approach	Description	Use Case
Unsupervised	Adapt using unlabeled domain data	Domain adaptation
Supervised	Fine-tune with labeled examples	Task-specific behavior
RL Fine-tuning	Optimize for reward signals	Robot policy learning

Code Example: Using a Pretrained WFM

python

from cosmos import CosmosWorldModel, CosmosTokenizer

# Load pretrained model
tokenizer = CosmosTokenizer.from_pretrained("nvidia/cosmos-tokenizer")
world_model = CosmosWorldModel.from_pretrained("nvidia/cosmos-wfm-base")

# Encode input video
video_tokens = tokenizer.encode(input_video)

# Generate future frames conditioned on action
action_embedding = encode_action("move forward")
future_tokens = world_model.generate(
    video_tokens,
    action=action_embedding,
    num_frames=16
)

# Decode back to video
generated_video = tokenizer.decode(future_tokens)

from cosmos import CosmosWorldModel, CosmosTokenizer

# Load pretrained model
tokenizer = CosmosTokenizer.from_pretrained("nvidia/cosmos-tokenizer")
world_model = CosmosWorldModel.from_pretrained("nvidia/cosmos-wfm-base")

# Encode input video
video_tokens = tokenizer.encode(input_video)

# Generate future frames conditioned on action
action_embedding = encode_action("move forward")
future_tokens = world_model.generate(
    video_tokens,
    action=action_embedding,
    num_frames=16
)

# Decode back to video
generated_video = tokenizer.decode(future_tokens)

Benefits of WFMs

Reduced Development Time: Start from pretrained model instead of scratch
Data Efficiency: Require less task-specific data
Consistent Quality: Inherit robust physical understanding
Scalability: Benefit from continued pretraining improvements

Summary

World Foundation Models democratize access to physical AI capabilities. By providing pretrained models that understand world dynamics, they enable developers to build robotics, autonomous systems, and simulation applications without collecting petabytes of data or spending millions on compute.

Knowledge Check

Test your understanding with 2 questions

Module Lessons

1What Are World Models?2World Foundation Models (WFMs)3Physics Understanding in World Models