NVIDIA’s Robotics Revolution: The End of VLA and the Rise of World Action Models

NVIDIAs

VLA is dead, and remote control is dead! This was the bold statement made by Jim Fan, the head of robotics at NVIDIA, during the Sequoia AI Ascent 2026 event. In just 20 minutes, he delivered two eulogies for the robotics industry. The first mourned the passing of VLA (Visual-Language-Action Models), which had dominated embodied intelligence for the past three years. The second bid farewell to remote control, which many believed would remain relevant for years to come.

Last year, Jim Fan spoke about how robots could be tested; this year, he has shifted the conversation to how old paradigms are dying and new paradigms are emerging. In Jim’s view, this new paradigm heavily relies on mimicking the operations of large language models (LLMs). The process begins with pre-training to simulate the next state of the world, similar to the next token prediction in LLMs, followed by action fine-tuning to calibrate the valuable aspects for real robots, akin to supervised fine-tuning, and finally, using reinforcement learning to complete the final steps.

Recently, NVIDIA has rolled out a series of initiatives, including EgoScale, DreamDojo, and Dream Zero, which have played a significant role in shaping the direction of embodied intelligence for 2026. During his presentation titled Robotics: Endgame, Jim shared his latest insights on VLA, world models, remote operation, UMI (Universal Manipulation Interface), and scaling laws in robotics. Here are some key takeaways from his talk:

Robotics: Endgame

In the summer of 2016, I found myself in the office we are now in, where I first met Jensen Huang. He was holding a large metal tray and wrote a message to Elon and the OpenAI team about the future of computing and humanity, showcasing the world’s first DGX-1. I had no idea of the journey that lay ahead.

As Ilya has best described, their belief in deep learning inspired all of us. It has been a journey of three steps over six years. The first step, in 2020, introduced GPT-3 pre-training, which focused on predicting the next token, learning grammar rules, and understanding language structure. The second step, in 2022, brought InstructGPT, which used supervised fine-tuning to calibrate the model for real tasks and employed reinforcement learning to surpass imitation learning. The final step, expected in 2026, involves automated research that accelerates the entire cycle, surpassing human capabilities. As Andrej said, every effort is aimed at the ultimate goal, and LLMs are now in their endgame phase.

Honestly, I envy the language model team; their joy is evident, and they are rapidly developing AGI (Artificial General Intelligence), referring to their creations as myths. Why shouldn’t those of us in robotics enjoy similar excitement? As a proud scientist, I took their approach and rebranded it as the Great Parallel. Instead of simulating strings, we simulate the next physical world state, calibrating valuable aspects for real robots through action fine-tuning and letting reinforcement learning complete the final steps.

In essence, the Great Parallel is about copying the successful strategies of language models. If we can’t beat them, we join them. The next section is titled Robotics: The End Game. Sorry, I couldn’t resist a little humor; bananas are just too amusing, thanks to Hassabis.

How to Navigate the Endgame?

It boils down to two strategies: model strategy and data strategy.

First, let’s examine the model strategy. For the past three years, VLA has dominated the field, with models like Pi and Gr00t fitting this category. We assumed pre-training was accomplished by visual language models (VLMs), adding an action head on top. However, these models are primarily language-visual-action models (LVAs), where most parameters focus on language, making it the core component, while visual and action aspects take a back seat.

In terms of design, VLA excels at encoding knowledge and nouns but falters in physicality and verbs; it’s a bit top-heavy. I particularly enjoy an example from the original VLA paper (RT-2) that involves moving code to an image of Taylor Swift. It had never seen her yet could generalize, but that’s not the type of pre-training capability we desire.

What should the second pre-training paradigm be? We believe it should be beautiful, yet unfortunately, it has devolved into AI Video Slop, like watching cats play banjos on surveillance cameras. While this seems entertaining, it wasn’t taken seriously until we realized these video models internally learn to simulate the next world state. For instance, Veo3 automatically learned physical laws like gravity and buoyancy without explicit encoding by predicting the next pixel mass, allowing these physical laws to emerge naturally, leading to visual planning.

How does Veo3 handle these objects? Through forward simulation in pixel space. Notice the lower right corner of this example; Veo3 is intelligent enough to discern that if you’re not looking, geometry is optional. I refer to this as physics slop. How do we make these world models useful? By using action fine-tuning to compress the superposition of all possible future states into those valuable for real robots. This is where Dream Zero comes into play.

Dream Zero is a novel strategy model that first “dreams” about what might happen in the next few seconds before taking action. Essentially, robotic motion control consists of a set of high-dimensional, continuous signals, which can be treated as streams of continuously changing data. Thus, we can render actions just like we render video. Dream Zero can decode two things simultaneously: the next world state and the next action to be executed. This capability allows it to perform tasks it has never seen before in a zero-shot manner. Interestingly, when the robot begins executing, we can even “see” in real-time what it’s thinking, and the correlation is strong; if the video prediction is accurate, the action is usually correct. If the video starts hallucinating, the action often fails. This marks the first time vision and action have truly converged.

We conducted many intriguing experiments with Dream Zero, allowing the robot to roam the lab while inputting various instructions into a prompt box to observe its responses. Admittedly, Dream Zero doesn’t yet achieve 100% stability across all tasks. However, it resembles GPT-2 in that it may not always be precise, but it often gets the general “shape” of the action right. Dream Zero represents our first step toward “open-ended tasks” and “open-vocabulary prompting” in robotics. We refer to this new model class as World Action Models (WAM). So, let us take a moment to mourn our old friend, VLA. It served us well. Rest in peace, VLA.

VLA is gone, and WAM is here. To realize WAM, we must focus on the next generation of data strategy. The individual in the image is Bill Dally, personally conducting remote operations in our lab. Considering his salary, I dare say this is the most expensive remote operation trajectory in our entire dataset. Over the past three years, the field of robotics has been almost entirely dominated by remote operation—what a golden era it was! Numerous VR headsets, systems optimized for low-latency streaming, and complex structures resembling medieval torture devices were all developed for remote operation. The entire industry invested heavily but also faced significant challenges.

The issue is that remote operation has a physical data production limit. Theoretically, a robot can collect a maximum of 24 hours of data in a day. However, realistically, if each robot can stably collect 3 hours of data daily, we would be quite grateful—provided the robot god is in a good mood. These machines often throw tantrums (robot malfunctions). So, how can we do better? Some propose putting a robot’s hand on your own. This system is called UMI (Universal Manipulation Interface). The idea is almost cunning; you wear the robot on your hands. Your hand movements dictate the robot’s movements, while the rest of the robot’s body is removed from the data collection loop. In other words, we use human hands to gather the data robots need.

In my view, UMI may be one of the greatest papers in the field of robot data. It has ultimately led to the emergence of two unicorn companies. On the left, team members from Generalist have optimized this design further, allowing the mechanical claw to be worn directly on the hand. On the right, Sunday created a three-finger data glove. Last year, we took a step forward by designing an exoskeleton system that achieves 1:1 mapping with a dexterous robotic hand. We call it DexUMI to showcase its practical effects.

On the left is the traditional and fastest data collection method, where humans perform operations, which will always be the quickest. On the right is remote operation, highlighting the challenges. The operator in the image is one of our most skilled PhDs, yet even he must align and calibrate with extreme care. This entire process is slow and exhausting, and the success rate is low. In the middle is our solution, where you simply wear this exoskeleton and complete the action, and the data is automatically collected. We use this data to train robot strategy models, resulting in a fully autonomous executing robot strategy.

The key point is that the remote operation data used during training was zero. This means we have, for the first time, broken the curse in robotics where each robot can only collect 24 hours of data daily. And look how happy these robots are—they no longer have to participate in data collection themselves. But the question remains: is this the end? Have we truly solved the scaling problem in robotics?

Is anyone here driving a Tesla or Waymo? When driving, you are participating in the world’s largest physical data flywheel. Even more remarkably, you don’t even notice it. Especially when Tesla’s FSD is operational, data uploads occur silently and automatically in the background. But wearing a UMI data device? Honestly, it remains too cumbersome and intrusive compared to the natural act of commuting to work. Hence, we need an FSD equivalent for robots, allowing data collection to retreat from the forefront, happening seamlessly in the background. Only then can we truly capture the full essence of human dexterous manipulation—not just in laboratories but across all sectors and economically valuable labor scenarios.

Based on this, we are fully investing in first-person human videos (egocentric videos) with detailed hand position tracking and high-density language annotations. We call this training paradigm EgoScale. In EgoScale, 99.9% of the training data comes from human first-person videos, resulting in a truly end-to-end robot strategy model capable of mapping pixel input from cameras to a dexterous robotic hand with 22 degrees of freedom. In other words, it goes directly from “seeing” to “doing.” What you are witnessing now is a fully autonomous executing robot.

During the pre-training phase, we used EgoScale, training on 21,000 hours of real-world first-person human data without utilizing any robot data. In the pre-training process, the model learns to predict joint positions and wrist postures. During the action fine-tuning phase, we only collected an additional 50 hours of high-precision data glove data and 4 hours of remote operation data, which accounted for less than 0.1% of the entire training data mix. Thanks to EgoScale, the model can generalize to a variety of highly dexterous tasks, such as organizing playing cards, operating syringes, and transferring liquids accurately. Perhaps one day, a robot nurse will emerge for home use. Interestingly, for tasks like folding shirts, the model learns a new folding strategy after just one demonstration during the testing phase.

One of the most exciting discoveries in this paper is that we have identified the neural scaling law for robotic dexterity, which describes the relationship between pre-training duration and optimal validation loss. This relationship is remarkably beautiful, resembling a nearly perfect log-linear curve. It has been six years since the language model community discovered the neural scaling law, and now robotics has finally arrived at its own scaling law.

If we were to plot these data strategies on a graph, with the X-axis representing alignment with robot hardware and the Y-axis representing scalability, it would look like this: remote operation is in the bottom left corner; it is closest to robot hardware but is almost impossible to scale. Moving up, we have data-wearable devices, which can scale to hundreds of thousands of hours. At the top, we have first-person videos. If we can effectively implement a robot version of the driving flywheel, its scale could easily reach millions of hours. If we draw a line on this graph, the left side represents a new paradigm in robotics: Sensorized Human Data, or human sensor data.

Therefore, I boldly predict that in the next one to two years, the proportion of remote operation will decline significantly—so much so that it may become negligible. Subsequently, we will see a plethora of data-wearable devices customized for different robot hardware and scenarios. Ultimately, the primary source of data for robots will shift to first-person human videos. So let us take another moment to mourn for our old friend, remote operation. It served us well. Rest in peace, remote operation.

But does this conclude our data strategy? No. Notice I drew two concentric circles. What about the outer circle? Today, all leading laboratories are heavily investing in millions of code environments for reinforcement learning, and robots are no exception. We urgently need vast amounts of environments. Of course, you can directly conduct RL on real robots. In our lab, we have used RL to push certain tasks to nearly 100% success rates, allowing robots to operate continuously for hours. Honestly, watching a robot assemble GPUs quietly is quite therapeutic. As one sage said, “Good boy.” (This task has been approved by the boss.)

But the problem is, if robots aim to scale reinforcement learning to a million environments like today’s large models, the traditional route is nearly impossible. Following the old methods, a million environments would necessitate preparing a million robots, which is unrealistic in terms of cost, maintenance, and deployment. Thus, we began to seek a new path. For instance, you can take an iPhone, snap a photo of the real world, and send it through a 3D world scanning pipeline. The system can automatically recognize all objects in the scene, extract their three-dimensional structures, and further reconstruct these objects in a classic physics simulator. More importantly, these reconstructed objects are not static models; they are interactive digital entities that can be manipulated and collided with. Researchers can then infinitely augment various variants based on these scenes, which they term Digital Cousins. At this point, the iPhone transforms from a mere smartphone into a true “pocket world scanner.” The entire process is named Real→Sim→Real—starting from the real world, entering simulation, and returning to the real world. With this approach, robotics finally gains the ability to scale the physical world into the digital realm.

Even so, this solution fundamentally relies on traditional graphics simulation. So, can we push it further? Enter Dream Dojo. Dream Dojo is a genuine neural simulator built on video world models; it does not receive traditional physical parameters as input but rather a series of continuous action signals. Its output consists of the RGB video frames that the robot will see next, along with corresponding sensor states, all generated in real-time. In other words, the pixels you see are not real.

Dream Dojo captures and learns the motion mechanisms and dynamic laws behind different robots in a purely data-driven manner, without any physical equations or graphics engines involved. Thus, robotics is entering a new post-training paradigm. A small number of real robot sites continually collect high-value interaction data, while on the other end, large-scale parallel graphics cores, world scans, and intensive inference computations support the ongoing iteration of world models. In this new paradigm, a critical equation is forming: Compute = Environment = Data. This means compute is becoming the environment, the environment is becoming the data, and the data itself defines the next round of computational investment, creating a self-reinforcing flywheel akin to the FSD flywheel in autonomous driving.

As our boss jokingly summarized, “The more you buy, the more you save.” This statement has also been approved by him. When you put all this together, you realize that robotics is following an evolutionary path almost parallel to that of large models, and this is not a future scenario; it is happening right now. What we are witnessing may very well be the dawn of the robot endgame.

I have always loved Civilization and enjoy imagining my research as constantly unlocking achievements on the technology tree of this game. Based on my assessment, only three achievements remain to be unlocked on the robotics technology tree. Once all are achieved, I can retire. The first achievement is the Physical Turing Test. This test essentially means that in sufficiently rich and complex real-world tasks, humans can no longer judge solely by observation whether the entity completing the work is human or a robot. It is not mysterious; it is simply about the output of labor value per unit of energy input. As long as the same energy input yields equivalent labor value, robots will have truly passed the Turing test in the physical world. Perhaps drunk humans are an exception, but considering today’s robots still have somewhat “stiff and slightly awkward” movements, we clearly have much work ahead. Nevertheless, if all goes smoothly, I believe we may achieve this in the next two to three years.

The second achievement is the Physical API. At that point, robots will no longer exist as individual machines; instead, they will transform into a truly programmable, callable, and orchestrated infrastructure, similar to today’s software services. You might not own a single robot but an entire fleet, controlled not through buttons and controllers but via APIs, CLIs, and more advanced orchestration systems. Perhaps one day, all of this will be centrally managed by even more powerful agents, like Opus4.6. Once the Physical API truly emerges, many concepts that today sound like science fiction will rapidly manifest, such as Lighthouse Factories, which will no longer resemble assembly lines but rather “atomic printers.” You will no longer input CAD drawings or complex engineering files; instead, it might be a simple markdown document resulting in fully assembled physical products, completed autonomously. Likewise, wET labs, or automated wet laboratories, will see robots independently conducting chemistry experiments, biological experiments, and even drug development, pushing the speed of scientific discovery to unprecedented heights.

The final and ultimate achievement on the robotics technology tree is Physical Auto Research. At that point, robots will no longer just execute tasks assigned by humans; they will begin designing themselves, optimizing themselves, and manufacturing the next generation of themselves, iterating at a pace far exceeding any human engineering team’s capabilities. You may find this sounds too much like science fiction. Will our generation truly witness this? Since the first forward pass of AlexNet in 2012, a model that struggled to distinguish cats from dogs, the entire AI community has only taken 14 years to reach the age of agentic AI. Today is 2026; if robotics follows a similar exponential curve, let’s give it 14 more years. 2026 is perfectly positioned between 2012 and 2040. Technology does not progress linearly; it always erupts exponentially.

Therefore, I am 95% confident that before 2040, we will indeed reach the endpoint of the robotics technology tree. And when that day arrives, we will still be young. If you believe in robots, they will ultimately respond to your belief. Our generation may have been born too late to explore the Earth and too early to explore the stars, but we were born at just the right time—to solve robotics challenges.

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/nvidias-robotics-revolution-the-end-of-vla-and-the-rise-of-world-action-models/

Like (0)
NenPowerNenPower
Previous May 11, 2026 7:05 am
Next May 11, 2026 9:09 am

相关推荐