World Foundation Models (WFMs)
World Foundation Models (WFMs)
World Foundation Models (WFMs) are a specialized class of world models that meet the scale and generalizability requirements of foundation models. Like GPT for language or CLIP for vision-language, WFMs serve as pretrained base models for physical AI applications.
What Makes a Foundation Model?
Foundation models share key characteristics:
- Scale: Trained on massive datasets (petabytes of video/images)
- Generalizability: Adaptable to many downstream tasks
- Transfer Learning: Knowledge transfers across domains
- Emergent Capabilities: New abilities emerge at scale
Notable World Foundation Models
NVIDIA Cosmos
NVIDIA Cosmos is a family of world foundation models designed for physical AI:
- Cosmos Tokenizer: Converts video to compact tokens
- Cosmos World Models: Generate physically plausible video
- Cosmos Transfer: Style and domain transfer capabilities
OpenAI Sora
Sora demonstrates world simulation through video generation:
"Sora is a diffusion model... We find that video models exhibit a number of interesting emergent capabilities when trained at scale, including 3D consistency, long-range coherence, and object permanence."
Google Genie
Genie models enable interactive environment generation:
- Genie 1: Learned to generate playable 2D games from images
- Genie 2: Extended to 3D environments
- Genie 3: General-purpose world model for diverse environments
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ World Foundation Model │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Encoder │ │ Dynamics │ │ Decoder │ │
│ │ (Tokenizer) │→ │ Model │→ │ (Generator) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↑ ↑ ↓ │
│ Video/Image Action/Text Video/Image │
│ Input Conditioning Output │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ World Foundation Model │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Encoder │ │ Dynamics │ │ Decoder │ │
│ │ (Tokenizer) │→ │ Model │→ │ (Generator) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↑ ↑ ↓ │
│ Video/Image Action/Text Video/Image │
│ Input Conditioning Output │
└─────────────────────────────────────────────────────────┘
Post-Training for Specific Tasks
WFMs can be specialized through post-training:
| Approach | Description | Use Case |
|---|---|---|
| Unsupervised | Adapt using unlabeled domain data | Domain adaptation |
| Supervised | Fine-tune with labeled examples | Task-specific behavior |
| RL Fine-tuning | Optimize for reward signals | Robot policy learning |
Code Example: Using a Pretrained WFM
from cosmos import CosmosWorldModel, CosmosTokenizer
# Load pretrained model
tokenizer = CosmosTokenizer.from_pretrained("nvidia/cosmos-tokenizer")
world_model = CosmosWorldModel.from_pretrained("nvidia/cosmos-wfm-base")
# Encode input video
video_tokens = tokenizer.encode(input_video)
# Generate future frames conditioned on action
action_embedding = encode_action("move forward")
future_tokens = world_model.generate(
video_tokens,
action=action_embedding,
num_frames=16
)
# Decode back to video
generated_video = tokenizer.decode(future_tokens)
from cosmos import CosmosWorldModel, CosmosTokenizer
# Load pretrained model
tokenizer = CosmosTokenizer.from_pretrained("nvidia/cosmos-tokenizer")
world_model = CosmosWorldModel.from_pretrained("nvidia/cosmos-wfm-base")
# Encode input video
video_tokens = tokenizer.encode(input_video)
# Generate future frames conditioned on action
action_embedding = encode_action("move forward")
future_tokens = world_model.generate(
video_tokens,
action=action_embedding,
num_frames=16
)
# Decode back to video
generated_video = tokenizer.decode(future_tokens)
Benefits of WFMs
- Reduced Development Time: Start from pretrained model instead of scratch
- Data Efficiency: Require less task-specific data
- Consistent Quality: Inherit robust physical understanding
- Scalability: Benefit from continued pretraining improvements
Summary
World Foundation Models democratize access to physical AI capabilities. By providing pretrained models that understand world dynamics, they enable developers to build robotics, autonomous systems, and simulation applications without collecting petabytes of data or spending millions on compute.
Test your understanding with 2 questions