Advancing Robot Intelligence: Alibaba’s RynnBrain Optimizes Robotic Skills for Real-World Tasks

Advancing

As we look ahead to 2026, many people wonder if robots will be able to make dumplings for the Spring Festival Gala. However, recent rehearsal reports suggest that this may not be feasible. Instead, robots are more likely to be designed to carry trays of dumplings. Industry experts recognize that allowing robots to make dumplings without the aid of programming or manipulation is far more complex than simply moving or navigating. This task involves working with dumpling wrappers, which are considered a flexible object for the Turing test of robots. Without an adequately intelligent “brain,” this task is unachievable.

This complexity is why, over the past year, increasing research efforts and funding have been directed toward enhancing the “brain” of robots. A recent initiative from Alibaba’s DAMO Academy, called RynnBrain, targets this very challenge. Unlike some projects focused on tasks like folding clothes or preparing breakfast, RynnBrain addresses more fundamental issues. For instance, if a robot is interrupted while doing housework to receive a package at the door, can it return and continue washing dishes? If a robot is tasked with an operation requiring multiple tools, will its plan include tools it does not have on hand?

These concerns may seem minor in the grand narrative of robotics, often overlooked and lacking relevant benchmarks, yet they are essential hurdles for robots to overcome before exiting the laboratory. In building RynnBrain, the DAMO Academy’s embodied intelligence team chose to start from the ground up, directly integrating spatiotemporal memory and physical reasoning into the model. This approach has yielded impressive results, achieving state-of-the-art (SOTA) performance on 16 embodied benchmarks.

Faced with constraints such as “three loaves of bread and two plates,” the model effectively handles spatial and long-range planning, deriving reasonable distribution schemes and demonstrating its planning and reasoning capabilities under constrained physical conditions. During a sorting task on a cluttered table, the robot can accurately remember completed steps after an interruption and continue executing the task, showcasing its memory and planning abilities amidst multitasking.

Furthermore, DAMO Academy has released the entire RynnBrain series, which consists of seven models, including RynnBrain-30B-A3B. This is the industry’s first mixture of experts (MoE) embodied foundation model, requiring just 3 billion parameters for inference activation to surpass the current largest embodied foundation model, Pelican-VL-72B. Utilizing this model enables robots to maintain powerful perception and planning capabilities while achieving faster response times and smoother behavior patterns.

Currently, a complete set of resources, including model weights, evaluation benchmarks, and full training and inference code, is available to the community. Links to the resources can be found on GitHub, Hugging Face, and the project homepage.

Integrating large models into robots is not as simple as it may seem. An amusing notion circulating in the industry is that one could just place large models like DeepSeek into a robot, but those who have attempted this know it is impractical. Models trained on 2D world data face a completely different environment when entering the physical world. For example, in the original 2D world, top visual language models (VLMs) can already understand the complete process of making dumplings, as their tasks involve interpreting static images without needing to interact with the environment. However, in the chaotic and cramped kitchen during a New Year’s Eve dinner, a robot relying solely on VLM language and visual experience may find itself at a loss. For instance, if the robot rolls out the dumpling wrapper and prepares to seal it but accidentally knocks over a nearby seasoning bottle, it may want to grab a cloth to wipe it up but cannot remember where it is, resulting in a stall in the task.

Moreover, if it “sees” filling on the table and confidently plans to “use a spoon to scoop the filling,” it may overlook a critical fact: the spoon is not on the table, leading to task failure. These scenarios sharply highlight the limitations of current general large models; while they are “knowledgeable,” they often engage in “armchair reasoning” in the physical world, lacking a continuous three-dimensional spatial understanding and not comprehending real-world physical interaction logic. This disconnect can lead to illusory planning that ignores physical constraints. RynnBrain aims to tackle this core issue by systematically introducing spatiotemporal memory and physical reasoning capabilities to ground the cognitive brain in the physical world.

Before RynnBrain, DAMO Academy conducted foundational research called RynnEC, which can be likened to giving large models a set of “eyes.” It can accurately answer questions about objects (attributes, quantity, function, segmentation, etc.) or spatial awareness (self-centered world perception + world-centered scale perception). For instance, when executing the task of “placing a tablet on the shelf,” it first considers “how wide is the tablet, can it be placed on the shelf without falling?” Similarly, before reaching for the soy sauce, it calculates the distance between itself and the soy sauce bottle to determine if it can reach it without moving.

The fine-grained cognitive input provided by these “eyes” serves as a crucial bridge connecting high-level planning with low-level control. RynnBrain not only inherits these abilities but also expands into diverse spatiotemporal memory and physical reasoning capabilities. The introduction of spatiotemporal memory directly addresses the “field of vision” pain points of current embodied large models. Existing brain models often can only solve localization tasks within the current field of view (image). If the target object or key point is out of sight, such as the aforementioned “cloth,” the model becomes ineffective. While there is a common brute-force solution to reprocess all historical images to locate the target, DAMO Academy argues that this approach separates time and space, ignoring the fact that the embodied scene is essentially a continuous, holistic three-dimensional world. Therefore, RynnBrain opts for a more cognitively aligned approach, utilizing historical memory to help models construct a more complete three-dimensional understanding.

This means that a robot’s decision-making and understanding are no longer limited to the immediate scene but can genuinely consider a comprehensive three-dimensional world model. Even amidst complex video changes and disturbances, the model can continuously track and identify used water bottles, demonstrating its long-term memory and understanding of objects in dynamic scenes. After major objects are moved, the robot can still retain memory of their spatial positions and accurately return them, showcasing its stable object and spatial memory capabilities.

So, how is this “human-like” global spatiotemporal recall achieved? The core lies in a “unified representation” encompassing multidimensional information such as space, location, events, and trajectories. In complex embodied interactions, robots encounter highly heterogeneous information. Traditional models often struggle to accommodate this heterogeneity, whereas RynnBrain’s breakthrough is in constructing a unified framework that maps all this information into the model’s output space. This means that the model processes not fragmented visual slices but integrates temporal dimensions, spatial coordinates, and semantic understanding, achieving precise grasp of the physical world at a fundamental level.

Next, let’s discuss the physical reasoning capability. In traditional VLMs, reasoning primarily occurs at the language level and is not necessarily bound to specific spatial locations or physical states. While a model might generate seemingly perfect plans, like “scoop filling with a spoon,” it may fail to recognize that the spoon is not within reach or where that tool is located. This “decoupling of semantics and space” can lead to physical illusions, resulting in failed tasks. To eliminate this disconnection, RynnBrain employs a reasoning strategy that intertwines “text and spatial localization.” Essentially, the model must “point while speaking.” During the reasoning text generation process, whenever a specific physical object or location is referenced, it must simultaneously predict the corresponding spatial coordinates or area mask. This constraint compels the model to accurately identify the spoon in pixel-level or three-dimensional coordinate systems while generating the instruction to “pick up the spoon.” Through this mechanism, RynnBrain effectively locks the abstract linguistic logic to the concrete physical environment, significantly reducing uncertainty in task execution and ensuring that each decision token is grounded in reality.

Having discussed the extensive capabilities of RynnBrain, how does it perform in practice? If we merely measure it against existing benchmarks, some of RynnBrain’s capabilities, such as spatiotemporal localization and operational point recognition, are difficult to assess. Current open-source evaluation benchmarks generally lack assessments of fine-grained comprehension and spatiotemporal localization capabilities. To address this gap, DAMO Academy has introduced a new benchmark called RynnBrain Bench, which encompasses four dimensions: object cognition, spatial cognition, object localization, and embodied point prediction, totaling 20 embodied-related tasks. This benchmark, alongside other existing benchmarks, provides a comprehensive assessment of model capabilities.

In the face of this demanding “exam,” RynnBrain initially demonstrated comprehensive and robust foundational model capabilities. Its 8B version leads the field in embodied cognition and localization tasks, outperforming advanced models such as Gemini Robotics ER 1.5, Mimo-Embodied, RoboBrain 2.0, Pelican-VL, and Cosmos-reason 2, achieving performance leaps of over 30% in many specialized abilities.

Moreover, RynnBrain shows no significant loss in generalization. It is known that many “embodied brain” models, specifically trained for robotic tasks, tend to overfit to specific tasks, losing the strong capabilities of general large models (such as document understanding and text reasoning). While achieving SOTA on embodied tasks, RynnBrain also retains the general visual capabilities of its foundational model (Qwen3-VL). The model can understand user dietary requirements, utilizing common sense judgments and Chinese OCR recognition to select options from multiple labeled items. Additionally, its open-source MoE version (RynnBrain-30B-A3B) allows robots to maintain superior perception and planning capabilities while achieving faster response times. It requires only 3B of inference activation parameters to outperform the current largest embodied foundation model, Pelican-VL-72B, demonstrating the principle of achieving more with less.

As a foundational model aimed at empowering downstream tasks, RynnBrain has also shown tremendous potential in the post-training phase. Experimental data indicates that its pre-training results significantly boost downstream tasks; for instance, in navigation tasks, merely fine-tuning it as a foundational model (RynnBrain-Nav) yields a 5% improvement compared to models using Qwen3-VL, and it achieves a 2%-3% higher navigation success rate than the current SOTA model, StreamVLN, without any architectural modifications.

In terms of operational planning, RynnBrain exhibits remarkable data efficiency; with just a few hundred samples for fine-tuning, its RynnBrain-Plan model develops strong long-term planning capabilities, surpassing Gemini 3 Pro in both in-domain and out-of-domain tasks. This “point-and-understand” quality validates that its innovative “text and localization intertwining” reasoning approach is better suited for the complex and dynamic physical world, retaining strong generalization capabilities that enable quicker adaptations to the required scenarios.

Ultimately, RynnBrain not only possesses a systematic cognitive architecture but also fills the critical gap between “understanding” and “action,” becoming the first embodied foundational model to support mobile operations. Regarding how to advance the “brain” of robots, the industry currently lacks a standard answer. Researchers at DAMO Academy have mentioned that current explorations roughly split into two approaches: one focuses on actions, directly learning how to operate in the real world, leading to the development of VLA models; the challenge here is the difficulty in obtaining high-quality data, which limits generalization. The other approach aims to leverage the inherent generalization capabilities of large models, seeking to enable the model to understand the world before discussing actions, but accurately aligning this understanding with real, continuous physical space remains an unavoidable challenge.

In this context, DAMO Academy has chosen to first solidify foundational capabilities instead of rushing to take sides. RynnEC focuses on establishing perception and understanding of the physical world, while RynnBrain advances spatiotemporal memory, spatial reasoning, and long-range planning. With these foundations in place, RynnBrain can serve as the “brain” for downstream models participating in real operations, and it has the potential to evolve directly into an operational foundation through post-training. The release of these capabilities is also intended to encourage the community to explore further on the same foundational model, rather than each entity reinventing the wheel. Meanwhile, DAMO Academy is also concurrently advancing a vision-driven VLA route (like RynnVLA) and connecting models, data, and real robots into a complete chain through system-level technologies like RCP, moving from “seeing” to “deciding” and then to “acting.” Looking further into the future, DAMO Academy has revealed that they are considering a more platform-based solution, aiming to build a more unified embodied intelligent infrastructure atop the fragmented hardware and algorithm ecosystem. Ultimately, solving the century-old problem of embodied intelligence requires not just the efforts of a single institution but the collective evolution of the entire open-source community.

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/advancing-robot-intelligence-alibabas-rynnbrain-optimizes-robotic-skills-for-real-world-tasks/

Like (0)
NenPowerNenPower
Previous February 10, 2026 11:11 am
Next February 10, 2026 12:16 pm

相关推荐