
Nanyang Technological University Achieves Breakthrough: Robots Can Finally Catch Fast-Moving Objects
This groundbreaking research led by the S-Lab at Nanyang Technological University (NTU) in Singapore was published in 2025, with the paper identifier arXiv:2601.22153v1. Readers interested in a deeper exploration can look up the complete paper using this identifier.
In science fiction movies, we often see robots skillfully catching flying balls or securely grasping rolling objects. However, in reality, this has been a significant challenge. Just as humans need quick visual tracking, brain predictions of trajectories, and timely adjustments of their arms to catch a ball, robots also require similar “quick reflexes.” Traditional robotic systems, much like slow-reacting individuals, often miss their targets simply because they take too long to decide on their actions.
The research team at NTU has addressed this long-standing issue in robotics. They developed a new robotic control system called DynamicVLA, akin to equipping robots with a fast response mechanism similar to that of an emergency room doctor. Traditional robots operate like general practitioners, slowly checking, thinking, and then prescribing actions. In contrast, DynamicVLA allows robots to observe, think, and act simultaneously, significantly enhancing their ability to handle urgent situations.
The core innovations of this system are threefold. Firstly, the research team designed a remarkably streamlined and efficient “brain,” consisting of only 400 million parameters (a drastic reduction compared to other systems that have billions of parameters), akin to replacing a heavy truck with a high-performance sports car for faster responses. Secondly, they implemented a continuous reasoning mechanism that allows for “thinking while acting,” eliminating the need for traditional systems to wait for one action to complete before starting to think about the next one. Lastly, they developed a “time-aware” action flow mechanism that automatically discards outdated instructions, ensuring that the robot always executes the most relevant and appropriate actions.
To validate this system, the research team built a specialized dynamic object manipulation testing platform from scratch. This platform comprises 206 different objects, over 2,800 unique scenarios, and has collected 200,000 simulation cases alongside 2,000 real-world cases. It functions like a professional “robot training camp,” enabling robots to practice their skills in catching moving objects under various complex conditions.
The experimental results are promising. In tasks involving fast-moving objects, DynamicVLA achieved a success rate of 47%, which is over 300% better than the best traditional methods. More importantly, this system not only performed excellently in simulated environments but also demonstrated outstanding performance on real-world robots, proving the practical value of the technology.
The implications of this research extend far beyond academia. In future factory production lines, robots will be able to handle moving items on conveyor belts more effectively. In home settings, robotic assistants will be able to catch dropped items or help tidy up scattered toys. In medical scenarios, robots will be able to assist doctors more precisely during delicate procedures that require real-time adjustments.
1. The “Slow Reflex” Problem in Robots Finally Solved
For a long time, robots have operated like individuals who are always a step behind when faced with moving objects. When you throw a ball to a traditional robot, it first needs to use its camera to see the ball’s position, then it spends time contemplating how to move its arm, and only after this process does it begin to act. This entire sequence resembles how a person must stop to observe and think carefully before reacting. By the time the robot extends its hand, the ball has already fallen.
The root of this problem lies in the “serial processing” model used by traditional robotic systems, functioning like a person who can only do one thing at a time. It must complete observation, then thinking, and finally action, in that order, without any overlap. Worse, while executing a series of preset actions, if the environment changes, the robot cannot adapt in time and must mechanically complete the pre-programmed sequence.
The research team discovered that this delay issue is magnified in dynamic environments. In static environments, like grabbing a stationary cup on a table, a few seconds of delay might not be critical, as the cup remains in place. However, in dynamic settings, even a 0.5-second delay can result in completely missing the target. This is similar to driving; if your reaction time is too long, you can never drive safely in complex conditions.
While traditional visual-language-action models excel at understanding complex instructions, they typically have billions of parameters, requiring powerful computational resources and extended processing times. These models are like knowledgeable scholars who think slowly; although they provide well-considered answers, they struggle in situations that demand quick responses.
The research team at NTU recognized that addressing this problem required fundamentally redesigning the robot’s “thinking process.” They needed to enable robots to learn how to think while observing and act while thinking, achieving true “quick reflexes.”
2. A Triad of Innovative Solutions
To tackle the core issues of robotic slow reactions, the research team introduced a comprehensive three-part solution, with each segment addressing specific technical bottlenecks. The first innovation was to design a lightweight yet efficient “robot brain.” Traditional robotic control systems are like using a large computer to control a smartphone; while powerful, they are slow to respond. The research team compressed the parameter count to 400 million, effectively providing robots with a compact yet high-performance “dedicated processor.” This system employs convolutional neural networks to process visual information, efficiently extracting and compressing spatial data like the human eye’s retina.
Secondly, they achieved a “continuous reasoning” mechanism. Traditional systems resemble a bank counter that must wait for one customer to leave completely before serving the next. In contrast, the new system operates like a modern fast-food restaurant assembly line. While a robot is executing a current action, its “brain” is already analyzing new environmental information and contemplating the next move. This overlapping work mode eliminates waiting time between actions, allowing the robot to continuously adapt to environmental changes.
The third key innovation is the “time-aware action flow” mechanism. This mechanism acts like an intelligent traffic control system, identifying which commands are outdated and which are the most current and effective. When the environment changes rapidly, robots will automatically discard outdated action commands generated based on previous information, prioritizing instructions based on the most recent environmental status. This ensures that each robot action is highly aligned with current conditions, avoiding ineffective actions.
These three innovations complement each other, forming an intelligent system capable of real-time responses in dynamic environments. The lightweight architecture ensures high-speed processing, continuous reasoning eliminates waiting times, and the time-awareness mechanism guarantees the timeliness of actions.
3. A Specialized Training Ground for Robots to Hone Their Skills
In order for robots to master the skill of catching moving objects, a specialized “training ground” is necessary. The research team found that existing robotic datasets were like providing parking lot practice for learning to drive, poorly simulating the complexities of real roads. Therefore, they built a comprehensive benchmark testing platform from scratch, named DOM (Dynamic Object Manipulation).
This training platform was designed similarly to a comprehensive driving school. Firstly, they prepared 206 different objects, ranging from fruits and vegetables to everyday containers, covering various shapes, weights, and materials. These objects would move at different speeds, simulating a road with bicycles moving slowly and fast cars zooming by. The friction coefficients of the objects would also vary, mimicking different environments from smooth surfaces to rough carpets.
To enhance training diversity, the research team created over 2,800 distinct 3D scenarios. These scenarios are akin to different testing environments, featuring bright, dim, simple, and complex settings. Each scenario is equipped with multiple camera angles, including close-range cameras on the robot’s wrist and distant panoramic cameras, ensuring that the robot can observe and understand the environment from multiple perspectives.
Most interestingly, the research team also developed a fully automated data collection system. Traditional robotic training requires manual remote control, which is impractical in fast-moving object scenarios as human reaction times cannot keep up. Thus, they designed a smart controller based on a state machine, akin to providing robots with an “autonomous driving instructor.” This system can track the 6D position and velocity information of objects in real-time, then drive the robot to complete the full action sequence of approaching, grabbing, moving, and placing.
In the simulated environment, this system generated 200,000 training cases, encompassing various potential object movement patterns and environmental conditions. Yet simulation is not reality; hence, the research team built a “real-world simulator.” They utilized high-speed cameras and advanced 3D tracking technologies to estimate the position and movement state of real objects in real-time, then employed the same intelligent controller to drive actual robots for training. This approach collected 2,000 real-world training cases, ensuring that robots could adapt to the various uncertainties of real environments. The entire training process resembles nurturing a versatile athlete, who not only practices basic skills in a standardized training ground but also accumulates practical experience in diverse real-world competitions.
4. Comprehensive Capability Testing to Prove Competence
To evaluate the true capabilities of the DynamicVLA system, the research team designed a comprehensive testing framework, akin to setting various exams for robots from beginner to advanced levels. This testing is divided into three main dimensions, each containing multiple specific challenges.
In the interactive ability test, the research team set up three different difficulty scenarios. The closed-loop response test is like assessing a driver’s reaction capabilities at different speeds, where the robot must respond to objects moving at varying speeds, ranging from stationary to 0.75 meters per second. The dynamic adaptation test is more challenging, akin to requiring the robot to deal with vehicles suddenly changing lanes, where objects might abruptly alter direction or speed, necessitating immediate strategic adjustments from the robot. The long-sequence coordination test assesses the robot’s “endurance” and “focus,” requiring it to handle multiple moving objects continuously, much like juggling several balls simultaneously.
The perception ability test evaluates the robot’s “vision” and “understanding.” In the visual comprehension test, the robot must accurately identify a target among multiple similar objects, much like finding the correct key among a bunch of similar ones. The spatial reasoning test requires the robot to understand relative positioning, such as placing a ball in the “left box” or “right tape area.” The motion perception test is particularly interesting, where the robot needs to identify targets based on object motion characteristics, such as grabbing the “slow-moving ball” or the “fast-rolling can.” The generalization ability test examines how well the robot adapts to unknown situations. The visual generalization test uses objects that were never seen during training, similar to asking a child who has only seen apples to recognize a pear. The motion generalization test introduces irregular movement patterns, where asymmetrical objects like potatoes may produce unexpected rolling trajectories. Lastly, the interference robustness test adds various “noise” to the environment, such as random collisions or pushes, testing the robot’s performance under imperfect conditions.
The test results are encouraging. Overall, DynamicVLA achieved a success rate of 47.06%, a significant leap compared to the best traditional method’s 13.61%. Particularly in the closed-loop response test, the success rate reached 60.5%, surpassing the second-best by 188%. Even in the most challenging long-sequence coordination tasks, the success rate hit 40.5%, far exceeding traditional methods’ less than 8%. Importantly, these outstanding performances were verified not only in simulated environments but also in real-world tests. Experiments conducted using the Franka robotic arm and AgileX PiPER robot demonstrated that DynamicVLA significantly outperformed traditional methods across various practical tasks, confirming the technology’s practical value.
5. Intricate Design Details of the Technology
The success of the DynamicVLA system arises not only from significant architectural innovations but also from countless meticulously crafted technical details. This is akin to a precision watch, where every gear and spring must fit perfectly to ensure accurate timekeeping. In terms of visual processing, the system employs FastViT as the visual encoder, which is a convolutional network optimized for speed. Unlike traditional transformer-based visual processors, FastViT is like using a professional camera instead of a smartphone for photography; while it may not have the most comprehensive features, it excels in specific tasks. It can rapidly compress high-resolution images into 36 key visual features, much like simplifying a detailed map into a few important landmarks, retaining critical information while significantly enhancing processing speed.
The language comprehension component utilizes the SmolLM2-360M model but only retains the first 16 layers. This “truncation” strategy resembles using a rapid diagnosis rather than a thorough medical examination, ensuring accurate understanding while significantly improving response speed. This streamlined language model can comprehend complex instructions like “grab the rolling orange and place it on the white tray,” translating them into executable action sequences for the robot.
The action generation section employs diffusion model technology, which may sound complex but is similar to a step-by-step refinement process in painting. The system first creates a “rough action sketch,” then refines it through multiple iterations, ultimately producing precise action directives. This method yields more natural and fluid robot movements, avoiding the rigidity and incoherence often seen in traditional methods.
In terms of time synchronization, the system implements an intricate “clock mechanism.” Each action directive is time-stamped, and the system continually monitors the current time and the timeliness of directives. If it detects that a directive has become “stale” (based on outdated environmental information), it promptly discards it and executes the latest information-based instructions. This is akin to a GPS navigation system that immediately replans the best route when you veer off course.
The training process was also meticulously designed. The system first underwent pre-training on a large dataset of image-text pairs to learn the basic visual-language correspondences, akin to learning to describe images. Then, it trained specifically on dynamic manipulation datasets to learn how to convert language instructions into action sequences. Finally, it fine-tuned through real robots to adapt to the specific hardware platform’s characteristics. The entire system’s memory usage is only 1.8GB, running at a frequency of 88Hz on the NVIDIA RTX A6000 graphics card, meaning it can process a complete perception-reasoning-action cycle 88 times per second. This high-frequency processing capability is key to achieving real-time dynamic manipulation.
6. Insights Revealed by Experimental Results
Through extensive experiments, the research team not only validated the superior performance of DynamicVLA but also discovered some interesting patterns and insights that hold significant value for the entire field of robotics. Firstly, the research found that time delay is the most critical factor in dynamic manipulation. Even a few tens of milliseconds of additional delay can significantly affect success rates. This is similar to driving; even a 0.1-second reaction delay at high speeds can lead to serious consequences. Experimental data shows that when reasoning time increases from 0.2 seconds to 0.4 seconds, the success rate in tasks involving fast-moving objects drops by over 30%. This finding emphasizes the necessity of designing specialized optimized systems for dynamic tasks.
Secondly, the research team found that the combination of continuous reasoning and time-aware mechanisms produces a synergistic effect. Using continuous reasoning alone can improve success rates by about 8%, while using the time-aware mechanism alone can lead to a 6% improvement. However, when both are utilized together, performance enhancement reaches 17%, exceeding the simple sum of their effects. This is akin to music; individual instruments sound nice, but when played together, they create a more harmonious experience.
In the analysis of performance across different types of tasks, the research team observed an interesting phenomenon. Robots performed better when handling regular-shaped objects (like balls or cans) but experienced a drop in success rates when dealing with irregular objects (like potatoes or bananas). This is because the motion trajectories of irregular objects are harder to predict, similar to predicting the bounce direction of a deformable ball compared to a standard basketball.
This discovery points to future improvements; the system needs to better understand how physical properties influence motion. Regarding the trade-off between model size and performance, experimental results revealed an interesting “sweet spot.” When the parameter count is too low (like 135 million), the system lacks understanding capacity and cannot accurately parse complex instructions. Conversely, when the parameter count is too high (like 1.7 billion), reasoning speeds become too slow, causing missed action opportunities. The configuration with 400 million parameters strikes the optimal balance between comprehension and responsiveness, akin to finding the perfect midpoint between a sports car and a truck.
In real-world experiments, the research team also discovered some patterns related to the transfer of simulations to reality. Vision-related capabilities transferred relatively well, as modern simulation engines have become highly realistic in visual rendering. However, capabilities related to physical interactions experienced a degree of performance decline, primarily because physical phenomena like friction and collisions in the real world are more complex and variable than in simulations.
Through ablation studies, the research team also validated the contribution of each component. Removing the FastViT visual encoder resulted in an 18% performance drop, demonstrating the importance of efficient visual processing. Removing the continuous reasoning mechanism led to a 7% decline, while the removal of the time-aware mechanism caused an 8% drop, confirming that each innovative component is indispensable. These in-depth analyses not only validate the rationale behind the design of DynamicVLA but also provide direction for future research. They indicate that dynamic manipulation is not merely an engineering problem but a complex system engineering challenge that requires careful balancing across multiple dimensions.
7. Broad Application Prospects and Future Outlook
The success of DynamicVLA technology opens a new chapter in robotic applications. This technology acts like equipping robots with a “responsive nervous system,” enabling them for the first time to cope with rapidly changing environments, which will have far-reaching impacts across various fields. In manufacturing, this technology will revolutionize production line design concepts. Traditional production lines require precise positioning devices and complex conveyor systems to ensure that objects remain in predetermined positions; these devices are costly and lack flexibility. With DynamicVLA technology, robotic workers will be able to handle moving items on conveyor belts directly, even adapting to changes in conveyor speed or item misplacement. This is akin to replacing automated systems reliant on precision equipment with trained human workers, enhancing both efficiency and adaptability.
In logistics and warehousing, robots will be better equipped to handle sorting tasks. Current automated sorting systems generally require items to move strictly along predetermined paths; however, packages often slide, roll, or deviate from their tracks in reality. Robots equipped with DynamicVLA will proactively track these “unruly” packages, significantly improving sorting accuracy and efficiency, particularly valuable in e-commerce logistics that deal with numerous irregular packages.
In the field of service robots, this technology will make household robots much more practical. Current household robots primarily manage simple static tasks, such as vacuuming or transporting fixed-position items. With dynamic manipulation capabilities, they will be able to catch your dropped phone, tidy up fallen toys, or even assist you in catching ingredients slipping off the cutting board while you cook. These seemingly simple tasks actually require highly complex real-time coordination abilities.
The prospects for applications in the medical field are equally exciting. Surgical robots will be able to better adapt to real-time changes during procedures, such as automatically adjusting their operational trajectory when slight movements occur due to the patient’s breathing or heartbeat. In rehabilitation therapy, robotic therapists can respond to patients’ movements in real time, providing more natural and effective assistive training.
However, the research team also candidly acknowledges the current limitations of the technology. Firstly, the system is currently optimized primarily for rigid objects, with the handling of liquids, powders, or soft materials still requiring enhancement. Secondly, in extremely dynamic environments, such as when objects move at speeds exceeding 1 meter per second, the success rates visibly decline. Additionally, although the system’s generalization capabilities are robust, it may still struggle when faced with completely unseen object types.
Future research directions are thus becoming clearer. The research team plans to further optimize the system architecture and explore more efficient visual-language integration methods to enhance understanding capabilities while maintaining speed. They also aim to expand into more complex physical scenarios, including multi-object interactions and flexible material handling. The long-term goal is to achieve truly universal dynamic manipulation capabilities, allowing robots to handle tasks in various dynamic environments as naturally as humans do.
The societal impact of this technology also merits attention. As robots become more flexible and reliable, they will be able to take on more tasks currently performed only by humans. This not only brings positive effects such as increased productivity and reduced human labor but also raises considerations about changes in employment structures. However, historical experience indicates that technological advancements often create new job opportunities, and human society will eventually find a balance coexisting with more intelligent robots.
Ultimately, DynamicVLA represents not just a technical breakthrough but a significant milestone in the journey of robots transitioning from laboratories to the real world. When robots can finally respond to dynamic environments as flexibly as humans, we move a step closer to the era of intelligent robots depicted in science fiction. This research from Nanyang Technological University may be recorded as a turning point in the history of robotics, marking the moment when robots truly began to acquire the capability to work independently in complex real-world scenarios.
Q&A
Q1: What is DynamicVLA?
A: DynamicVLA is a robotic control system developed by Nanyang Technological University specifically designed to solve the problem of robots catching fast-moving objects. It equips robots with a quick response mechanism similar to that of an emergency room doctor, enabling them to observe, think, and act simultaneously, achieving over 300% improvement in success rates compared to traditional methods.
Q2: How does this system differ from ordinary robots?
A: Ordinary robots react slowly, requiring them to first see and understand before acting, often missing their targets in the process. DynamicVLA, on the other hand, can simultaneously perform observation, reasoning, and action, automatically discarding outdated commands to execute the most suitable actions at all times.
Q3: When will DynamicVLA technology be applied in daily life?
A: The technology has been successfully validated in laboratory settings and may first be applied in factory production lines and warehousing logistics in the coming years. The application in household robots will need further optimization and cost reduction, but this technology signifies that robots are beginning to acquire the capability to work in complex real-world environments.
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/nanyang-technological-university-achieves-breakthrough-robots-can-now-capture-fast-moving-objects/
