
The trillion-yuan embodied intelligence sector is currently facing a significant challenge related to data scarcity. Discussions surrounding general artificial intelligence are shifting from text and images to the physical world, highlighting the importance of embodied intelligence—equipping AI with a physical presence that allows it to perceive, understand, and interact with the real environment. This area is emerging as the next critical battlefield in the global technology race.
However, unlike the data-rich landscape of language models, the “brain” models of embodied intelligence are experiencing an unprecedented “data hunger.” Training a robust embodied intelligence brain capable of generalizing across complex, long-sequence tasks requires high-quality, multi-modal, and spatial-temporal aligned “human behavior data.” This necessity is driving a systemic revolution in hardware architecture, data collection, and processing paradigms.
According to the Development Research Center of the State Council, it is projected that China’s embodied intelligence market will reach ¥400 billion by 2030 and exceed ¥1 trillion by 2035. Additionally, the China Academy of Information and Communications Technology has included embodied intelligence in its national industrial priorities, predicting a global market size of ¥19.525 billion by 2025.
In the first three months of 2026 alone, the financing scale for China’s embodied intelligence sector approached ¥30 billion, a year-on-year increase of 63%. Companies like Guanglun Intelligent have secured over $500 million in funding, setting a record in domestic financing for this field, while Zhijili Power completed a $200 million Series B funding round, valuing the company at over $1 billion. Xinghai Map also raised ¥2 billion in a Series B+ round. Capital is rapidly flowing into this sector.
However, the path to integrating embodied intelligence into everyday life and industries has not been smooth. Song Jiqiang, Vice President of the Intel Research Institute and Head of Intel China Research Institute, has pointed out that the current development of embodied intelligence is at a critical juncture, focusing on both enhancing capability limits and ensuring minimum standards. While many companies showcase the intelligent capabilities of robots, few address how to manage their shortcomings—this is a significant gap that must be bridged for industrialization.
Companies like Yushu Technology and Galaxy General are producing embodied intelligence “bodies” capable of impressive feats like somersaults and dancing. However, much of this is accomplished through pre-programmed sequences. In essence, while the current “small brain” of embodied intelligence is advanced, the focus now needs to shift to how robots can execute commands more autonomously and human-like at the “big brain” level.
Zhu Yanming, co-founder of Jianzhixin, stated that the modeling capabilities of current embodied intelligence companies remain limited to very short, simple tasks such as folding clothes, pouring water, or picking up cups. This reflects a common industry reality: while demonstrations may be impressive, practical applications are still far from realized. These carefully designed tasks are typically conducted in controlled environments, and there is a substantial gap in meeting the complex, variable, and long-chain task requirements found in real-life settings such as homes, factories, and logistics.
Zhu believes that there is still a need for breakthroughs in academic research on embodied models, while the gap in industrialization and commercialization is even larger. The core of this gap lies in the existing models’ lack of deep understanding of the physical world and robust interactive capabilities. The widely endorsed VLP (Vision-Language-Planning) approach, which is based on language models, excels at planning based on textual instructions but results in actions that are essentially just trajectories and behaviors generated from language, lacking the continuous loop of “cognition-action-physical feedback-new cognition” found in the real world.
As a result, there is a growing consensus in the industry to construct a “world model.” The core of this model is to enable AI to understand fundamental physical laws such as friction, rigid body dynamics, and spatial relationships, rather than merely executing language-based trajectory planning. This marks a shift in the development of embodied intelligence from “imitating language logic” to “learning physical laws.”
An interesting trend in this context is the influx of talent from the intelligent driving sector into embodied intelligence. Zhu noted that this migration is not coincidental; there is a deep resonance in the technical stacks (such as Vision-Language-Action models and environmental simulation) and product methodologies between the two fields. More importantly, the “data-driven closed-loop” product iteration architecture honed in intelligent driving, which involves continuously training, testing, and optimizing models with real data, is precisely the engineering capability that embodied intelligence needs to transition from demonstrations to practical applications.
However, whether pursuing theoretical breakthroughs in world models or drawing on engineering experience from intelligent driving, both paths point to a common bottleneck: the extreme scarcity of high-quality training data. If computational power is the engine and algorithms are the blueprint, then data is the fuel. Without appropriate fuel, even the most powerful engines and intricate blueprints cannot drive embodied intelligence toward practical realities.
This has led a number of startups, like Jianzhiren, to pivot away from competing on the models themselves and instead focus on providing a “data foundation” as a more differentiated value infrastructure. Wang Qi, CMO of Tosida’s embodied intelligence business line, has indicated that data pain points primarily manifest in three areas: first, the lack of unified data standards, as different companies’ robot configurations yield data that is difficult to interchange, creating data barriers; second, the challenges and high costs of data collection, with the complexity of industrial scenarios making it difficult to gather data, coupled with high equipment and labor costs that are particularly burdensome for small and medium-sized enterprises; and third, concerns over data privacy and security, as companies fear that sharing production line data could leak core processes, discouraging cooperation in data collection.
Training a powerful embodied intelligence brain, especially a world model, imposes almost stringent requirements on data. Data collection needs to encompass three critical dimensions: multi-modal, high-precision, and strong causality. Current mainstream data collection solutions face significant challenges in all three dimensions. In terms of multi-modality, humans learn through interaction with the world, integrating visual, auditory, tactile, force, and even proprioceptive inputs. Similarly, embodied intelligence models need to reconstruct this multi-sensory input.
Zhu emphasized that non-visual modalities, such as touch, play a critical supervisory role or serve as validation and feedback for results. For instance, distinguishing between a two-millimeter and a one-millimeter screw may be difficult visually, but tactile feedback makes the difference clear. However, many current collection schemes rely heavily on visual data, resulting in a lack or poor quality of key modal data like touch and force.
High precision requires that data be highly aligned in time and space. Time-wise, different sensors have varying collection frequencies; thus, ensuring that the tactile signal of “hand touching a cup” strictly corresponds to the contact frame in video footage is crucial. Space-wise, hand movements need to be accurately rendered into an absolute coordinate system based on the head or environment. Traditional solutions have inherent flaws: flexible gloves often exhibit unstable absolute precision due to differences in wearing and physical deformation; purely visual schemes lose tracking when the hand is obscured (such as when reaching into a drawer), resulting in data interruptions.
Zhu indicated that these precision drift and occlusion issues are significant reasons why solutions become “unusable” in home or industrial scenarios; low-quality data can even inject “physical illusions” into the model. Strong causality means that the data ultimately used for training must consist of complete, interpretable “action chains.” It should not only document “what was done” (action sequences) but also include “why it was done” (cognition and decision-making) and “what the outcome was” (physical feedback). For example, the data needs to record a complete loop such as “seeing the cup (visual) – deciding to take it (cognition) – moving the arm and adjusting finger posture (action) – feeling the weight and sliding trend of the cup (tactile/force feedback) – fine-tuning grip strength (adjustment) – successfully picking it up (result).” Traditional collection methods can only capture actions and partial visuals, leading to a broken causality chain. Relying on extensive manual annotation later is prohibitively costly and difficult to scale.
According to Jianzhiren, estimating their need to process over 20,000 hours of data weekly means that if they relied solely on manual efforts, they would require a labeling team of nearly 5,000 people—an unrealistic scenario. Clearly, existing collection technologies cannot efficiently and accurately produce such data.
While the development of embodied intelligence hardware is rapidly advancing, the data bottleneck has become the heaviest lock holding back the evolution speed of embodied intelligence brains. Traditional solutions are unable to meet “new demands.” In light of the demanding requirements for model training, data collection technologies must undergo a profound paradigm shift. Traditional data collection solutions are increasingly inadequate for current needs, facing issues such as insufficient precision of flexible wearable devices, visual collection being easily obstructed, difficulties in aligning multi-modal data, and low collection efficiency, all of which significantly affect data quality and scale.
To address these issues, technological innovation is necessary to reconstruct the hardware architecture and software processes of data collection, creating a high-precision, multi-modal, efficient, and low-cost data collection system. On the hardware side, mainstream solutions for capturing hand posture accuracy and stability involve flexible gloves combined with inertial measurement units (IMUs), which estimate joint angles through algorithms but carry inherent errors due to physical deformation. By iterating traditional flexible collection devices into more rigid systems that resemble human skeletal structures, they can physically eliminate errors caused by flexible deformations.
Zhu explained that they utilize a bionic design, employing a rigid connection to directly measure joint relative displacements, fundamentally solving precision issues stemming from flexible structures. The core hardware design logic of Jianzhiren’s Gen DAS Dex incorporates an exoskeletal structure that allows for full-degree-of-freedom coverage, leveraging self-developed magnetic encoders for ultra-high precision detection while also ensuring lightweight design, reducing the size of the magnetic encoders to just 3mm. The overall volume is comparable to standard ski gloves, making it easy to wear without impacting normal user operations; the device’s weight is only 210g, allowing it to perfectly capture hand movements even during complex operations and precise grasping without the burden of interference.
Furthermore, to enhance precision and stability, the team has implemented real-time calibration and compensation mechanisms. Each encoder and rigid angle detection phase achieves real-time calibration on the edge; simultaneously, built-in absolute temperature detection in the encoder compensates for drift caused by temperature changes. Additionally, fusion of vibration, tactile, and visual feedback minimizes overall drift to near-zero levels, ensuring data precision and stability across different hand shapes, scenes, and environments.
In terms of tactile sensing, they have abandoned low-resolution solutions in favor of developing high-resolution magnetic tactile sensors. The goal is not only to perceive “contact or no contact” but also to achieve a three-dimensional force perception array (normal and tangential forces) to capture rich information such as sliding, texture, and hardness. Zhu explained that this enables the model to learn critical state information such as “micro-strain,” which is essential for understanding physical laws like friction.
To address the critical issue of visual occlusion, Jianzhiren has designed a “side-end positioning + head-hand coordination” solution. An IMU is integrated at the back of the hand, along with an independent camera positioned under the hand, using single-hand SLAM (simultaneous localization and mapping) technology combined with the relative positional information of the head and hand to restore spatial-temporal coordinates. This approach maintains good continuous positioning capability during brief or partial occlusions (such as reaching into a drawer or gripping the hand) with positioning drift kept to millimeter levels, ensuring uninterrupted data collection.
On a more fundamental level, they have achieved strict clock synchronization of hardware through self-developed System-on-Chip (SoC) and communication protocols, controlling communication delays between multiple devices to under 1 millisecond. On the software side, they utilize high-confidence events, such as “tactile contact,” as “ground truth” to dynamically calibrate and causally align multi-modal data, forming an “end-side dual-loop dynamic calibration” mechanism to ensure long-term data collection precision remains stable.
Data collection is just the first step; transforming raw data into usable “human data” for models presents an even greater challenge. Zhu shared Jianzhiren’s solution: they have developed an end-to-end processing model where the input is raw multi-modal data streams and the output is standardized data packets that are spatial-temporally aligned, causally closed, and semantically explained (COT). This system has led to exponential efficiency improvements: first, real-time quality checks at the collection stage filter out ineffective actions; second, their proprietary compression algorithm reduces multi-channel video stream data to just 2% of its original size without losing critical information; finally, through streaming transmission and cloud-based data foundation model automation, what previously required thousands of person-years of annotation work can now be managed by a small team. This capability enables large-scale, diverse data collection.
As the industry collectively recognizes that “world models” require data to thrive, a deep innovation centered around data foundations is already underway. From rigid bionic hardware to edge-side intelligent integration, and to automation powered by data foundation models, these systematic breakthroughs are attempting to answer a fundamental question: how to accurately record human experiences in the physical world for training robots. This “data foundation” revolution is quietly laying the groundwork for embodied intelligence to integrate into the physical world. The entity that masters the efficient production of “human data” may hold the key to unlocking the era of general embodied intelligence.
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/the-data-dilemma-in-the-trillion-dollar-embodied-intelligence-race/
