Research World Model

World Models are like giving a robot an “internal brain simulator”: it can take the current visual input and the intended action, then directly predict and generate a realistic video clip of the next few seconds. This allows the robot to “imagine” the outcome in its head without needing to blindly trial-and-error with the real hardware every time.

Our world model uses a diffusion Transformer to achieve extremely fine-grained action-to-frame alignment. The generated video quality and physical plausibility significantly outperform previous models.

In real robot tasks, by using this “dreaming”, performance improves dramatically. It also supports real-time keyboard/VR control on datasets with virtual arms, while maintaining very fast inference. This truly lets “imagination” help robots learn skills efficiently.

Blow figures are out world model demos: