Back to Data Processing & Curation

Data Collection for World Models

20 min

Data Collection for World Models

Training world models requires massive amounts of high-quality data. This lesson covers strategies for collecting the diverse, multimodal data needed for world model development.

Data Requirements

World models need diverse data types:

Data TypePurposeScale
VideoLearn temporal dynamicsMillions of hours
ImagesLearn spatial relationshipsBillions of images
Sensor DataLearn physical measurementsTerabytes
TextLearn semantic groundingBillions of captions
ActionsLearn cause-effectMillions of trajectories

Data Sources

1. Web-Scale Video

  • YouTube, Vimeo, and video platforms
  • Requires filtering for quality and relevance
  • Example: WebVid-10M, HD-VILA-100M

2. Simulation Data

  • Physics engines (MuJoCo, Isaac Sim)
  • Game engines (Unity, Unreal)
  • Advantages: Perfect labels, controllable

3. Robot Data

  • Real robot demonstrations
  • Teleoperation recordings
  • Example: Open X-Embodiment dataset

4. Autonomous Vehicle Data

  • Driving recordings with sensors
  • LiDAR, camera, radar fusion
  • Example: Waymo Open Dataset, nuScenes

Data Collection Pipeline

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Sources   │ →  │  Ingestion  │ →  │   Storage   │
│  (Web, AV,  │    │  Pipeline   │    │  (S3, GCS)  │
│   Robots)   │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Quality   │ ←  │  Metadata   │ ←  │  Transcoding│
│   Filtering │    │  Extraction │    │  & Chunking │
└─────────────┘    └─────────────┘    └─────────────┘

Code Example: Video Collection Pipeline

python
import asyncio
from dataclasses import dataclass
from typing import List

@dataclass
class VideoMetadata:
    url: str
    duration: float
    resolution: tuple
    fps: float
    has_motion: bool

class VideoCollector:
    def __init__(self, output_dir: str):
        self.output_dir = output_dir
        self.quality_filter = QualityFilter()
    
    async def collect_videos(self, urls: List[str]):
        tasks = [self.process_video(url) for url in urls]
        results = await asyncio.gather(*tasks)
        return [r for r in results if r is not None]
    
    async def process_video(self, url: str):
        # Download video
        video_path = await self.download(url)
        
        # Extract metadata
        metadata = self.extract_metadata(video_path)
        
        # Quality filtering
        if not self.quality_filter.passes(metadata):
            return None
        
        # Chunk into segments
        segments = self.chunk_video(video_path, segment_length=10)
        
        return {
            "metadata": metadata,
            "segments": segments
        }

Quality Considerations

Resolution and Frame Rate

  • Minimum 720p for training
  • 24+ FPS for smooth motion
  • Consistent aspect ratios

Content Quality

  • Clear, non-blurry frames
  • Stable camera (or intentional motion)
  • Good lighting conditions

Diversity

  • Multiple environments
  • Various weather/lighting
  • Different object types

Legal and Ethical Considerations

When collecting data, consider:

  1. Copyright: Ensure rights to use content
  2. Privacy: Remove or blur personal information
  3. Consent: Obtain permission where required
  4. Bias: Ensure diverse representation

Summary

Data collection is the foundation of world model development. Success requires diverse sources, robust pipelines, quality filtering, and careful attention to legal and ethical considerations.