
In the past year, embodied intelligence has taken center stage in the industry. On one hand, there has been a surge in funding and viral demonstration videos showcasing robots unlocking increasingly complex movements. On the other hand, there are significant challenges, such as limited real-world applications, stability issues, and unresolved core problems related to cost and safety.
Despite the unprecedented excitement surrounding the field, critical questions are beginning to surface: What stage of technological development is embodied intelligence currently in? How far are we from widespread use and large-scale deployment?
During a roundtable discussion at the Force Brain Technology Open Day held on February 10, participants from academia, research institutions, and industry were hesitant to offer overly optimistic conclusions. The consensus seemed to be that we are still far from the “ChatGPT moment” for embodied intelligence. Whether it is model capabilities, hardware maturity, or the systems for data, evaluation, and standards, embodied intelligence remains in a phase of high uncertainty. The paths of model development are still diverging, and the systemic issues revealed in real-world deployments are more complex and challenging than those encountered in simulated environments.
“We are still far from the ‘ChatGPT moment’ for embodied intelligence,” stated Wang Zhongyuan, director of the Beijing Academy of Artificial Intelligence. He acknowledged the current hype around the field but also highlighted significant concerns. He analyzed the imbalance from both hardware and model perspectives: on one hand, advancements in hardware capabilities are evident, with robots evolving from being able to walk to run and perform tasks; on the other hand, issues regarding continuous operation stability, safety, and battery life remain unresolved.
Wang also pointed out that when the models and hardware of embodied intelligence are actually deployed in real-world scenarios, the industry realizes the significant gap that still exists before achieving the desired large-scale applications. From a model perspective, he remains cautious, emphasizing that both modular “VLM (Visual Language Model) + Control” solutions and end-to-end VLA (Visual-Language-Action) models are still in exploratory stages. He stated, “At this stage, it is premature to claim that embodied intelligence has reached a fundamental breakthrough.” The more realistic path may not be a one-time solution to generalization issues, but rather to first accomplish tasks in real scenarios, gather more data, create a data feedback loop, and then address the generalization challenges.
From the hardware perspective, Professor Wang Yu from Tsinghua University’s Department of Electronic Engineering believes that the capabilities currently demonstrated by robots are mostly limited to specific tasks. When tasks extend in duration or complexity, they require coordination between various cognitive processes. In real environments, the complexity increases significantly. Wang illustrated this with an example: transitioning from folding a single piece of clothing to cleaning an entire room involves not just executing a single action but also perceiving the overall environment, establishing task goals, and continuously completing cross-modal, multi-step tasks.
Wang posed a fundamental question: if robots are to enter human living spaces, does the environment need to change as well? He argues that the physical space is currently designed entirely around human needs, making it unreasonable to expect robots to perceive and adapt with 100% human-like perceptual abilities. He suggested that, similar to the vehicle-road coordination approach, transforming the physical environment could provide an alternative pathway for the continuous advancement of machine intelligence.
However, the industry has yet to reach a consensus on what constitutes the “ChatGPT moment” for embodied intelligence. Jiang Daxin, founder and CEO of Jumping Star, emphasized that this moment is marked by achieving zero-shot generalization, where models can understand instructions and complete tasks in scenarios they have never encountered. He noted, “Compared to natural language, I believe the ‘ChatGPT moment’ for embodied intelligence will be even more challenging.” The generalization in embodied intelligence is not a single dimension but occurs across multiple layers, including scenario, task, and goal. The combination of different dimensions leads to a lack of agreement on how to define the ‘ChatGPT moment’ in this context.
From a technical standpoint, Jiang recounted the evolution of natural language processing (NLP) before and after the advent of the Transformer model architecture. He believes that NLP’s rapid progression was largely due to solving self-supervised pre-training issues, which facilitated the compression of vast amounts of internet knowledge and enabled the execution of complex tasks. In embodied intelligence, however, the industry has yet to establish a unified understanding of fundamental issues such as visual encoding methods and 3D spatial reasoning mechanisms. He suggested that the industry may need to wait for breakthroughs in these areas before truly reaching the “ChatGPT moment.”
Gao Jiyang, founder and CEO of Star Sea Map, offered a more industry-oriented perspective, pointing out that embodied intelligence and large language models fundamentally differ in their industrial forms. The scarcity in large models lies primarily within the model itself, meaning that if the model is strong, the entire commercialization and industrialization chain is effectively in place. However, embodied intelligence has a longer chain, with an immature supply chain for components, insufficient scale of complete machines, and a heavily offline-focused channel and end-user market, all of which mean that algorithms alone cannot create a turning point.
Given these conditions, Gao prefers to interpret the “ChatGPT moment” for embodied intelligence as a time when it possesses commercial value within defined parameters. He anticipates that as complete machines, supply chains, data, and models gradually come together over the next two years, 2026 could mark a pivotal juncture. “2026 is likely to be the year to form an ‘application closed loop.’ In the first half of 2025, we will see that embodied intelligence is still in a nascent exploratory stage; by the second half, its development speed will significantly accelerate. 2026 could be the year of a technological explosion, which will drive certain application areas to create spillover effects and synergize with the supply chain and complete products,” Gao stated.
Tang Wenbin, co-founder and CEO of Force Brain, further lowered the threshold for this moment. He views Jiang’s definition of the “ChatGPT moment” as being closer to achieving AGI (Artificial General Intelligence) goals. Tang emphasized that the core of the embodied intelligence “ChatGPT moment” is to complete closed loops in defined scenarios, calculate ROI (Return on Investment), and achieve large-scale application deployment. “What was the biggest shock that ChatGPT brought us? We once viewed it as a toy, but at that moment (the ChatGPT moment), we recognized it as a tool, something usable,” he remarked. In Tang’s view, when robots transition from being toys to tools, that moment embodies the significance of the “ChatGPT moment.”
While judgments about the ultimate form of embodied intelligence remain diverse, there is a growing consensus within the industry on the next steps: real machines, evaluations, and standards. Tang acknowledged that the current challenges in embodied intelligence are not simply points of capability inadequacy but stem from a lack of a comprehensive technological framework. “Whether in terms of data or hardware, we are lacking many elements in training, reasoning, and the entire chain, and evaluations are also deficient,” he stated.
He argued that without the ability to evaluate real capabilities, models cannot truly evolve, and the existing industry rankings are insufficient in scale. “Does a score of ninety-nine point something on the list truly represent current real capabilities? Clearly not, so we need large-scale, real machine assessments based on the physical world to guide us forward,” he emphasized. Gao echoed this sentiment, indicating that embodied intelligence will likely form vertical categories driven by genuine needs, and it is crucial to translate these needs into real machine evaluations to provide a fair iterative environment for businesses and stakeholders. “Often, AI remains an experimental science; while it has certain principles and mathematics underlying it, many results still need to be tested, and testing requires feedback, which necessitates evaluations,” he highlighted.
Wang connected this evaluation system with the future of an open ecosystem, suggesting that frequent, sustainable real-world evaluations would be more effective than infrequent large-scale competitions. He believes this system should ultimately exist in a more public and open manner, providing foundational support for the entire industry through open-source frameworks, hardware, data, and evaluations.
Looking ahead to 2026, Wang expressed his hope for the establishment of unified standards across hardware, data, and model outputs, which would reduce verification and replication costs and promote ecological collaboration. “Currently, standards for hardware, data, and model outputs are extremely fragmented,” he noted.
Tang focused on a practical indicator: “One scenario, one thousand units, continuous operation.” He argued that scaling does not occur through the accumulation of scenarios but by successfully running a closed loop within a single scenario, which signifies that embodied intelligence has truly crossed the threshold of industrialization.
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/the-rising-interest-in-embodied-intelligence-industry-experts-caution-against-blind-optimism-as-chatgpt-moment-remains-distant/
