Simplicity is the ultimate sophistication --Leonardo da Vinci
đź”® Future Receptive: World Model
World Models are like giving a robot an “internal brain simulator”: it can take the current visual input and the intended action, then directly predict and generate a realistic video clip of the next few seconds. This allows the robot to “imagine” the outcome in its head without needing to blindly trial-and-error with the real hardware every time.
Our world model uses a diffusion Transformer to achieve extremely fine-grained action-to-frame alignment. The generated video quality and physical plausibility significantly outperform previous models.
In real robot tasks, by using this “dreaming”, performance improves dramatically. It also supports real-time keyboard/VR control on datasets with virtual arms, while maintaining very fast inference. This truly lets “imagination” help robots learn skills efficiently.
Blow figures are out world model demos:
Short Trajectory RT-1 World Model Prediction
Short Trajectory RT-1 Ground Truth
Long Trajectory RT-1 World Model Prediction
Long Trajectory RT-1 Ground Truth
Short Trajectory Bridge World Model Prediction
Short Trajectory Bridge Ground Truth
Long Trajectory Bridge World Model Prediction
Long Trajectory Bridge Ground Truth
Short Trajectory Language-Table World Model Prediction
Short Trajectory Language-Table Ground Truth
Long Trajectory Language-Table World Model Prediction