Data Scarcity Challenges the Growth of Embodied Intelligence in Robotics

When tasked with cracking a walnut, a robot might smash it against the table as if it were an egg. It may take the robot ten minutes to retrieve a bottle of mineral water from the fridge, and when asked to fold clothes, it meticulously attempts to align and correct its actions, only to end up crumpling the fabric into a ball. After proving itself as a ‘dancer’ and a ‘long-distance runner,’ robots are now required to perform more ‘practical’ tasks, leading to a series of amusing mishaps.

For robots to be practical, they must interact with a rich physical world, which necessitates a substantial amount of embodied intelligence data for training, explains Zhang Lihua, a distinguished professor at Fudan University and founder of Feijieke Technology (Shanghai) Co., Ltd. “According to incomplete statistics, the global demand for high-quality data in research and development is approximately 1.2 million hours, while the industry currently produces only about 250,000 to 300,000 hours of data each month. The scarcity of high-quality embodied intelligence data has become one of the critical bottlenecks in the development of embodied intelligent robots.”

The year 2026 has been dubbed the ‘year of embodied intelligence data’ by industry insiders, marking a shift in the embodied intelligence robot sector from being algorithm-driven to data-driven, with high-quality data becoming a fundamental strategic resource in the industry.

Severe Data Shortages

In recent years, large language models have made significant advancements by learning to generate language from the vast amounts of textual data available on the internet. Following the same logic, embodied intelligent robots require immense datasets of human motion to learn how to operate in the real world.

For example, the simple action of “picking up dry mushrooms” is effortless for a human but requires the robot to coordinate multiple skills, such as material recognition and spatial posture matching. Achieving reliable and stable execution for such tasks necessitates billions of high-quality human motion data as support.

However, unlike children who learn by imitation, robots need structured datasets that include positional coordinates, torque quantification, and tactile feedback annotations. Consequently, although there is an abundance of text and videos on the internet, they cannot be directly fed to robots due to the lack of motion data.

“The text and video data used by large language models are essentially ‘static data from an observer’s perspective,’ while embodied intelligence requires ‘interactive perspective’ data,” Zhang explains. “A suitable grasping action dataset must include not only visual information but also real-time force feedback, tactile perception, and continuous variations in motor torque.” He notes that there are virtually no ready-made ‘multi-modal instruction-action’ datasets available online that can be directly mapped to the robot’s perception and control pathways. “We face not an optimization of data but rather a foundational accumulation from scratch.”

Cai Chen, a product manager at JD Cloud, states that “training a high-quality model requires at least tens of millions of hours of data.” However, the current market only has mature embodied intelligence datasets amounting to several hundred thousand hours, which are insufficient to train high-quality, general-purpose embodied large models.

While tokens are universal in the realm of large language models, data in the field of embodied intelligence is highly dependent on hardware. Due to the constraints of robotic configurations, data cannot easily be reused across different robots, resulting in a fragmentation of collected data that hampers the formation of scale effects.

For instance, two robots of different heights—1.2 meters and 1.8 meters—will have significantly different motion trajectories when grasping objects at the same height, making it challenging to transfer effective data from the 1.2-meter robot to the 1.8-meter model. Cai emphasizes that the inability to maximize the utility of a single dataset is a crucial factor contributing to the shortage of embodied intelligence data.

Additionally, the rapid advancement of robot models has further highlighted the data shortage. Embodied intelligent robots are typically divided into three core components: the “brain,” “cerebellum,” and “body.” The core of the robot’s “brain” consists of large embodied intelligence models. The more complex and intricate the tasks the robot undertakes, the more complex the structure and larger the parameter scale of the embodied intelligence models become.

Currently, the parameter scale of robotic models has increased from millions to hundreds of millions, exacerbating the data shortfall issue. Cong Zheng, a senior researcher at Shanghai New Times Intelligent Equipment Co., Ltd., explains that previously, models with millions of parameters could achieve satisfactory training with relatively little data. Now, complex models with hundreds of millions of parameters require vast amounts of data to ensure proper training and operational stability.

The “Impossible Triangle”

A black mechanical hand securely grips a baby bottle while another mechanical hand scoops the right amount of milk powder, with a doll nearby eagerly awaiting feeding. This scenario is not a scene from an immersive role-playing game but rather a data collection activity taking place at the Beijing Humanoid Robotics Innovation Center data base.

“The data base is the ‘knowledge producer’ for robots. We use real machine teleoperation to produce high-quality data following a series of standardized processes including collection, cleaning, desensitization, inspection, annotation, and quality control,” says Kong Chao, head of data operations at the center. The data base currently has a daily production capacity of 600 hours and has accumulated 40,000 hours of high-quality embodied intelligence data, maintaining a qualification rate of over 95%.

In contrast to the large-scale collection of internet text through web scraping, obtaining high-quality embodied data is labor-intensive and costly. Zhou Mingcai, a deputy researcher at the Institute of Automation of the Chinese Academy of Sciences and head of the embodied operations center at Beijing Zhongke Huiling Robot Technology Co., Ltd., explains that, unlike large language models that handle discrete tokens, embodied intelligent robots require continuous data on joint torque, end effector pose, and tactile feedback. This milliseconds-level precision data relies on high-precision physical interactions, creating a high entry barrier for data collection.

Currently, the main methods for collecting embodied intelligence data include four categories: real machine teleoperation, motion capture, human behavior videos, and simulation-generated data.

Real machine teleoperation involves humans wearing exoskeleton devices or controlling robots to provide “hands-on” teaching. While this method produces high-quality data through strong physical interaction, it is costly and inefficient and constrained by the robot’s body and the surrounding environment.

Another method involves wearing multiple sensors on a human body for motion capture. This approach is less expensive than real machine teleoperation and scalable but requires human-robot motion redirection due to differences between human and robot configurations.

Human behavior videos are recorded while individuals perform tasks, capturing specific spatial locations of each action for robots to learn. While this method is low-cost and scalable, it often lacks precise annotations for pose, touch, and torque, making it difficult for robots to learn fine motor skills.

Due to cost considerations, simulation-generated data has also become a significant category in embodied intelligence data. This method resembles playing a video game, where various actions are performed in a virtual environment. While this collection method is controllable and scalable, it faces a gap in realism compared to the real world. “Because physical engines struggle to accurately replicate the deformations, friction, and subtle physical properties of real-world objects, simulation data often contains biases, leading to difficulties when applied directly to robots,” Zhou admits.

According to Kong Chao, the current embodied intelligence data landscape presents an “impossible triangle,” where high quality, large scale, and low cost cannot be achieved simultaneously.

Zhang Lihua agrees, stating, “The ‘impossible triangle’ is indeed a core contradiction in the industry. While real machine teleoperation yields high-quality data, the need for hundreds of millions of samples to generalize for large models renders a one-to-one collection method inadequate. Conversely, low-cost data from ordinary videos, low-fidelity simulations, or crudely annotated datasets can be easily scaled but often lack physical attributes, actionable motion, and transferability, leading to models that ‘appear capable but perform unstably’ when directly trained.”

The scarcity of embodied intelligence data is not merely a matter of quantity; there is an acute shortage of high-quality, multi-modal, and aligned data that can support complex physical reasoning. “This shortage is essentially a necessary stage in technological evolution. Whichever entity makes breakthroughs in automated data collection, heterogeneous data normalization, and efficient transfer from simulation to reality will gain a competitive advantage in the upcoming landscape,” Zhang asserts.

Diverse Data Fusion and Complementation

In Suqian, Jiangsu, the JD Robot Data Collection Center continuously receives and analyzes video data from courier sorters and supermarket shelf stockers. “The first-person perspective collection terminals worn on their heads can precisely annotate finger positions, bending angles, and other information,” Cai Chen explains. JD plans to collect 10 million hours of video data over the next two years, covering logistics, retail, and household scenarios.

As hardware costs decrease and humanoid robots enter small-scale trial production, the industry increasingly recognizes that solely relying on manual efforts to ‘teach’ robots is unsustainable. The consensus is shifting from “single-source collection” to “multi-source fusion.”

JD Cloud utilizes a full-chain processing approach to achieve ‘one-stop’ transformation and generalization of human behavior videos, simulated data, and real machine operation data, thereby enhancing overall training efficiency. According to Cai, human behavior video data collected at the terminals is fed into an AI data lake platform that, leveraging its petabyte-level processing power, can automatically perform cleaning, alignment, conversion, and pre-annotation, becoming a crucial component of high-quality training data. Simultaneously, simulation models are constructed to generate large batches of high-fidelity synthetic data, while real machine operation data obtained from task execution is also fed back into the platform.

As model capabilities and video recognition extraction capabilities improve, first-person human behavior video data is increasingly used for pre-training robots. “While a large amount of video can train robots to dance and perform, actual operations in factories still rely on genuine real machine teleoperation data, as the robot’s hand position in space and intricate movements cannot be taught through video alone,” Cong Zheng further explains. For instance, screwing in a bolt is a relatively delicate action, and the angle may not align perfectly with the screw hole. Humans instinctively know how to apply force when tilted slightly, but training a robot to accomplish this task requires extensive real machine teleoperation data. This is the essence of the robot’s generalization capability.

“Currently, the industry predominantly employs a hybrid training strategy. Companies are no longer solely dependent on a single data source but are blending multiple sources of data in specific proportions. This combination ensures both the precision of actions and the generalization capability of the scenarios, making it the most effective means to address the data dilemma,” Zhou Mingcai says.

Zhang Lihua also points out that a single technical route struggles to meet the demands of scale, cost, precision, and generalization. The industry is forming a path of integration that involves “infusing human videos with universal physical knowledge, simulating and synthesizing to cover edge cases, lightweight collection to enhance real interactions, and high-precision teleoperation to fine-tune specific scenarios.”

Kong Chao provides an analogy: “When a child begins to learn something, you don’t need to teach them in detail. Just showing them a lot of things allows them to gradually understand. Then, with some targeted corrections, they can excel.” For companies developing embodied intelligent robots, diverse data fusion and complementation is indeed the most effective approach.

Many companies in the industry are adopting a progressive training pathway that starts with vast amounts of video data leading to high-value real machine teleoperation data, initially using low-cost, large-scale video data to provide a foundation for robots to understand their tasks. This is followed by leveraging high-fidelity simulation models to generate substantial controllable data to familiarize robots with various scenarios and expand their generalization. Finally, high-value, smaller datasets from real machine teleoperation are used for correction and calibration, enabling robots to execute fine motor skills. In this way, the high-cost real machine teleoperation data does not have to shoulder the entire training burden but serves as a crucial anchor point for validating model capabilities and correcting deviations.

Urgent Need for Unified Standards and Protocols

Due to the heavy reliance on data-driven development in the embodied intelligence sector, various companies are competing to enter the data collection arena, showcasing their capabilities. Some focus on upgrading collection equipment, others on continuously iterating physical simulation models, while others invest heavily in multi-configuration real machine teleoperation collection.

High-quality data cannot simply be formed through random collection; it requires a comprehensive, standardized process as a safeguard. The enterprises visited by the reporter have established their own data collection systems; however, differences in data storage formats, metadata structures, and annotation granularity across various companies have made inter-company data circulation nearly impossible, resulting in isolated “data silos.” In this fragmented approach, substantial resources are redundantly invested in similar data collection and technological research, leading to significant waste.

“The most pressing need in the industry today is not merely to increase collection equipment or expand simulation scenarios but to establish a set of industry-wide data standards that encompass all stages from ‘collection, generation, annotation, cleaning, training, evaluation, to feedback,'” Zhang Lihua states. The challenge of unifying data standards for embodied intelligence lies in its dynamic nature, which must be closely coupled with tasks, the robot itself, the physical environment, and model capabilities. Without standardized data formats, physical attribute labels, task definitions, and quality evaluation standards, data sharing between different companies becomes exceedingly difficult.

The decentralization of robotic technology routes presents another significant barrier. Robots of varying configurations differ in degrees of freedom, link lengths, sensor distributions, and accuracy, making it challenging to transfer and utilize collected data. The Beijing Humanoid Robotics Innovation Center data base, for example, has procured 120 robots from seven different brands with varying configurations to accommodate diverse data requirements across different robotic enterprises.

“How to reuse cross-body data is also a question,” Kong Chao further explains. With the vast array of robot types, significant discrepancies in body forms and structural designs exist, ranging from two-finger to five-finger dexterous hands. Data collected for one type of robot is often not applicable to another, and the difficulty in sharing data hinders industry progression. “This is not merely a problem of the data collection industry but rather a result of the diverse developments in the robotics sector. To enhance the circulation of embodied intelligence data, standardization of robot configurations is also necessary.”

In addition to establishing unified data standards, Zhang Lihua believes there is a need to improve the high-fidelity physical representation capabilities of embodied data. “Robots ultimately need to operate in the real world; thus, data must reflect the actual interactions, mechanics, materials, and causal relationships. Moreover, data evaluation is crucial; the industry should not only focus on data scale but also on whether the data genuinely enhances the model’s success rate, robustness, and safety in real tasks.”

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/data-scarcity-challenges-the-growth-of-embodied-intelligence-in-robotics/

Data Scarcity Challenges the Growth of Embodied Intelligence in Robotics

相关推荐