Data Collection for World Models
20 min
Data Collection for World Models
Training world models requires massive amounts of high-quality data. This lesson covers strategies for collecting the diverse, multimodal data needed for world model development.
Data Requirements
World models need diverse data types:
| Data Type | Purpose | Scale |
|---|---|---|
| Video | Learn temporal dynamics | Millions of hours |
| Images | Learn spatial relationships | Billions of images |
| Sensor Data | Learn physical measurements | Terabytes |
| Text | Learn semantic grounding | Billions of captions |
| Actions | Learn cause-effect | Millions of trajectories |
Data Sources
1. Web-Scale Video
- YouTube, Vimeo, and video platforms
- Requires filtering for quality and relevance
- Example: WebVid-10M, HD-VILA-100M
2. Simulation Data
- Physics engines (MuJoCo, Isaac Sim)
- Game engines (Unity, Unreal)
- Advantages: Perfect labels, controllable
3. Robot Data
- Real robot demonstrations
- Teleoperation recordings
- Example: Open X-Embodiment dataset
4. Autonomous Vehicle Data
- Driving recordings with sensors
- LiDAR, camera, radar fusion
- Example: Waymo Open Dataset, nuScenes
Data Collection Pipeline
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ → │ Ingestion │ → │ Storage │
│ (Web, AV, │ │ Pipeline │ │ (S3, GCS) │
│ Robots) │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
↓
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Quality │ ← │ Metadata │ ← │ Transcoding│
│ Filtering │ │ Extraction │ │ & Chunking │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ → │ Ingestion │ → │ Storage │
│ (Web, AV, │ │ Pipeline │ │ (S3, GCS) │
│ Robots) │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
↓
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Quality │ ← │ Metadata │ ← │ Transcoding│
│ Filtering │ │ Extraction │ │ & Chunking │
└─────────────┘ └─────────────┘ └─────────────┘
Code Example: Video Collection Pipeline
python
import asyncio
from dataclasses import dataclass
from typing import List
@dataclass
class VideoMetadata:
url: str
duration: float
resolution: tuple
fps: float
has_motion: bool
class VideoCollector:
def __init__(self, output_dir: str):
self.output_dir = output_dir
self.quality_filter = QualityFilter()
async def collect_videos(self, urls: List[str]):
tasks = [self.process_video(url) for url in urls]
results = await asyncio.gather(*tasks)
return [r for r in results if r is not None]
async def process_video(self, url: str):
# Download video
video_path = await self.download(url)
# Extract metadata
metadata = self.extract_metadata(video_path)
# Quality filtering
if not self.quality_filter.passes(metadata):
return None
# Chunk into segments
segments = self.chunk_video(video_path, segment_length=10)
return {
"metadata": metadata,
"segments": segments
}
import asyncio
from dataclasses import dataclass
from typing import List
@dataclass
class VideoMetadata:
url: str
duration: float
resolution: tuple
fps: float
has_motion: bool
class VideoCollector:
def __init__(self, output_dir: str):
self.output_dir = output_dir
self.quality_filter = QualityFilter()
async def collect_videos(self, urls: List[str]):
tasks = [self.process_video(url) for url in urls]
results = await asyncio.gather(*tasks)
return [r for r in results if r is not None]
async def process_video(self, url: str):
# Download video
video_path = await self.download(url)
# Extract metadata
metadata = self.extract_metadata(video_path)
# Quality filtering
if not self.quality_filter.passes(metadata):
return None
# Chunk into segments
segments = self.chunk_video(video_path, segment_length=10)
return {
"metadata": metadata,
"segments": segments
}
Quality Considerations
Resolution and Frame Rate
- Minimum 720p for training
- 24+ FPS for smooth motion
- Consistent aspect ratios
Content Quality
- Clear, non-blurry frames
- Stable camera (or intentional motion)
- Good lighting conditions
Diversity
- Multiple environments
- Various weather/lighting
- Different object types
Legal and Ethical Considerations
When collecting data, consider:
- Copyright: Ensure rights to use content
- Privacy: Remove or blur personal information
- Consent: Obtain permission where required
- Bias: Ensure diverse representation
Summary
Data collection is the foundation of world model development. Success requires diverse sources, robust pipelines, quality filtering, and careful attention to legal and ethical considerations.