Tsinghua Researchers Launch Open-Source Unified World Model, Outperforming Silicon Valley Benchmarks by 40%

Tsinghua

Tsinghua University Graduate Open Sources Unified World Model: Performance Surpasses Silicon Valley Benchmark by 40%!

A domestically developed open-source embodied world model has outperformed Pi-0.5, with a lead from Tsinghua University master’s and doctoral students. This is the Motus, a unified world model co-released by Shengshu Technology and Tsinghua University. The project is primarily led by Bi Hongzhe, a second-year master’s student, and Tan Hengkai, a third-year doctoral student from Professor Zhu Jun’s TSAIL laboratory at Tsinghua University.

The term “unified” refers to Motus’s architecture, which integrates five paradigms of embodied intelligence: VLA (Vision-Language-Action), world modeling, video generation, inverse dynamics, and video-action joint prediction. This is the first time a complete “see-think-act” feedback loop has been achieved. In tests across 50 general tasks, Motus’s success rate exceeded that of the internationally recognized Pi-0.5 by more than 35%, with some tasks achieving an impressive 40% improvement! With Motus, robots now possess the ability to predict future outcomes.

For instance, in the Cloudflare human verification task, the robot effortlessly handles irregularly shaped mouse movements. The video can be viewed here. It clearly shows that the robotic arm controlled by Motus can not only accurately identify the mouse but also smoothly and continuously move based on the distance to the screen click box, ultimately executing a precise click.

In the long-range multi-step reasoning task of Kongming Chess, Motus similarly demonstrates a tight logical loop, resolving the chess situation step by step. The video can be found here.

Another challenging task for robots is folding clothes, which is considered a nightmare for robotic systems. The video can be viewed here. The deformation of clothing, a flexible object, continuously occurs during the process, yet under Motus, the entire operation is smooth and seamless, demonstrating a human-like touch and anticipation.

Motus’s emergence has identified the Scaling Law in the realm of embodied intelligence, mirroring the earlier miracle of GPT-2 being defined as an “unsupervised multi-task learner.” Many CTOs and founders, upon reviewing the results, exclaimed “brilliant!” Previously popular works such as NVIDIA’s Cosmos policy and DreamZero were believed to have disrupted the VLA paradigm, transitioning towards WA (World Action Models) or VA (Vision Action) paradigms, but their core ideas are similar to Motus.

Currently, all of Motus’s code and model weights have been open-sourced (links are provided at the end). Let’s delve into how this unified world model was realized.

In the past, the field of embodied intelligence was quite fragmented. Models like VLA, world modeling, video generation, inverse dynamics, and video-action joint prediction struggled to form a cohesive whole. The standout feature of Motus is its ability to encompass all five paradigms within a single framework.

The technology behind this unification is the Mixture-of-Transformer (MoT) architecture, combined with a Tri-model Joint Attention mechanism. Simply put, this approach gathers three experts together: through Tri-model Joint Attention, these experts can exchange information in real-time within the same attention layer. This endows the robot with a capability akin to human cognition: it can not only see (perceive) but also envision the future outcomes of actions (predict), thereby deducing the actions to take in the present (decide). This embodies the “see-think-act” loop mentioned earlier.

However, training such a versatile model requires more than just efforts at the framework level; data is also a significant challenge. Real-world robotic data is expensive and scarce, while the internet offers a plethora of videos that often lack action labels. To tackle this issue, Motus employs a strategy called Latent Action.

The research team utilizes optical flow technology to capture pixel-level movement trajectories in videos, proposing a Delta Action mechanism that translates these pixel changes into action trends for the robot. This concept is quite ingenious, akin to teaching a robot martial arts by observing experts’ movements in films. Although there are no direct human-guided labels (no real machine data), the robot learns by observing the motion trajectories (optical flows) of skilled performers.

Consequently, Motus can process everything from expensive real-world data to vast amounts of internet videos and human first-person perspective videos (Egocentric Video), extracting general priors of physical interactions. Moreover, based on the data pyramid and latent actions, Motus has established a three-stage training process to gradually distill general physical dynamics knowledge into precise robotic control abilities.

Experimental results indicate that the Scaling Law effectively operates in the physical world. In the simulation leaderboard RoboTwin 2.0, Motus achieved an average success rate of 88% across 50 general tasks. Particularly in the high-difficulty Stack Bowls Three task, where even minor errors could cause the bowl tower to collapse, previous baseline models had success rates below 16%, indicative of severe instability. In contrast, Motus’s success rate soared to 95%!

More impressive than individual achievements are the Scaling Curves. The upper image displays data volume scaling, while the lower image shows task quantity scaling. The red line represents Motus, and the blue line represents Pi-0.5. As the number of training tasks increases (horizontal axis), the blue line (Pi-0.5) trends downward, indicating that traditional model architectures tend to overfit when facing multiple tasks, learning the new while forgetting the old. Conversely, the red line (Motus) consistently rises, demonstrating that as long as the model architecture is sufficiently unified and the data sources diverse, embodied intelligence can emerge with cross-task generalization capabilities, akin to LLMs. This mirrors the impact GPT-2 had on the NLP field, establishing language models as unsupervised multitask learners. Now, Motus replicates this miracle in the realm of embodied intelligence.

In real-world testing, whether using the AC-One or Agilex-Aloha-2 robotic arms, Motus has shown impressive adaptability. Data reveals that Motus’s data efficiency surpasses its competitors by 13.55 times. This means that to achieve the same level of performance, Motus requires just a fraction of the data compared to others.

Finally, let us turn our attention to the team behind this unified world model. Motus was released jointly by Shengshu Technology and Tsinghua University, with two very young Tsinghua students serving as co-leads: Xie Shenghao, Wang Zeyuan, Huang Shuhe, and Liu Haitian, all from Tsinghua’s TSAIL laboratory (Professor Zhu Jun’s research group). As a co-releasing entity, Shengshu Technology’s decision to open-source Motus reflects its strategic positioning in world modeling. Familiar followers of Shengshu Technology know that they have recently completed a new round of financing and have consistently maintained that video-based large models are central to the path to AGI. They believe that video inherently encapsulates the physical space, causal logic, and dynamic evolution of the real world. The introduction of Motus is a crucial piece of this strategy, marking a transition for robots from “mechanical execution” to “end-to-end intelligence,” and propelling the industry from isolated breakthroughs to a unified foundation.

The collaboration between academia and industry has catalyzed significant innovation: Shengshu’s extensive experience with multi-modal large models, combined with Tsinghua’s top-tier algorithmic expertise, has culminated in the creation of the Motus unified world model. Motus is set to be fully open-sourced and its research paper published by December 2025, ahead of the industry by two months. Recently, the popular video model-based embodied intelligence route has seen Shengshu Technology and Tsinghua University publish the Vidar embodied video model in July 2025, leading the industry by half a year. Motus is now completely open-sourced, and interested parties can explore it further!

Research paper: here

Project page: here

Open-source repository: here

Model weights: here

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/tsinghua-researchers-launch-open-source-unified-world-model-outperforming-silicon-valley-benchmarks-by-40/

Like (0)
NenPowerNenPower
Previous February 9, 2026 12:56 am
Next February 9, 2026 2:29 am

相关推荐