Westlake University’s Novel Approach to Robot Decision-Making: Integrating Historical, Current, and Future Insights in Action Models

Imagine a robot reaching for a cup on a table. It lifts the cup, then suddenly stops, places it back, and reaches for it again. This repetitive action, as if it forgot what it just did, is not uncommon in real-world scenarios. For instance, a button might be pressed repeatedly, or a drawer might be pushed even after it’s closed. These failures are not due to a lack of vision but stem from an absence of a “world model” capable of simulating the evolution of time and space.

Current visual language action models can comprehend images and instructions, yet they depend solely on immediate observations for decision-making in continuous tasks. When tasks require a series of steps—like picking up an object, moving it, placing it down, and then closing a device—repetitive actions and decision disruptions occur. The core issue lies in the model’s inability to understand time. This challenge is becoming a critical bottleneck in the development of embodied intelligence.

Most existing methods rely on a “see and act” mechanism that performs well in short tasks but struggles with longer sequences, leading to incoherent actions and decision drift. The challenge now is to create models that not only perceive the current state but also remember the past and anticipate the future. In this context, the team led by Wang Donglin from Westlake University introduced the paper titled “HiF-VLA: Hindsight, Insight and Foresight for Vision-Language-Action Models.” This research presents HiF-VLA, which moves beyond simply relying on historical images or predictions of future scenes. It uses “motion” as the core expression of temporal information, enabling the model to simultaneously model past changes, current states, and future trends, leading to more stable continuous decision-making.

The significance of this research lies not just in performance improvement but in proposing a new paradigm that shifts robots from “passive response” to “thinking while acting.” As embodied intelligence progresses towards real-world applications, the ability to understand time is becoming a key factor in determining whether systems are truly usable.

The research tested HiF-VLA on the LIBERO-Long task, which evaluates whether robots can complete multiple actions continuously, such as picking up objects, placing them, and closing devices. The results indicated that HiF-VLA achieved a success rate of 94.4% under single-view conditions and 96.4% under multi-view conditions. In comparison, the current strong method, OpenVLA-OFT, reached 91.0% in single-view and 94.0% in multi-view settings. This shows an improvement of 3.4% in single-view and 2.4% in multi-view scenarios. Notably, in ten specific tasks, several achieved a success rate of 100%, with the lowest at 76%, indicating stable overall performance rather than dependence on individual tasks.

Another critical observation is that HiF-VLA’s performance under single-view conditions is approaching, or even matching, that of other methods under multi-view conditions. This suggests that the performance enhancement primarily results from its temporal modeling capability rather than relying on additional visual information or camera quantity.

In the CALVIN cross-environment generalization task, models were trained in environments A, B, and C, then tested in an unseen environment D. The evaluation metric was the number of tasks successfully completed without interruption. The results showed that HiF-VLA achieved 4.08 in single-view and 4.35 in multi-view, while OpenVLA-OFT scored around 4.10, Seer around 4.28, and RoboVLMs around 4.25. The highest score of 4.35 under multi-view conditions indicates a significant improvement of approximately 0.25 tasks compared to the baseline. This increase is crucial since any failure at any step renders subsequent tasks uncounted, meaning a higher number reflects stronger stability in long-term continuous decision-making and better long-term planning capabilities.

Regarding efficiency and computational cost, the study analyzed whether performance improvements come at the expense of increased computation. The results showed that when introducing image-based future sub-goal predictions, the success rate was 91.8%, but the latency increased to 115.9 milliseconds, making it 1.59 times slower than the baseline. Conversely, using stacked historical frames resulted in a success rate that dropped to 90.4% with a latency of 229.5 milliseconds, which is 3.15 times the baseline. This indicates that excessive image information not only incurs high computational costs but also interferes with the model’s judgments. In contrast, HiF-VLA achieved a success rate of 92.2% with a latency of 82.7 milliseconds by only incorporating future reasoning, and the same success rate when only utilizing historical information, with a latency of 117.7 milliseconds. By combining both, the success rate reached 93.2% with a latency of 121.6 milliseconds. Overall, this method improves success rates while maintaining lower computational costs compared to the historical frame stacking approach, demonstrating that using motion information is more efficient than relying solely on historical images.

In terms of temporal length extension capabilities, the study gradually increased the historical length from 4 to 8, then to 16 and 32. Results indicated that the optimal performance was at a length of 8, with 94.4% success in single-view and 96.4% in multi-view. Further increases in length resulted in performance declines, likely due to information overload causing redundancy. In terms of latency, traditional methods exhibit a linear increase in computational cost with historical length. When the length reaches 8, latency increases by approximately 4.5 times, while HiF-VLA maintains stable latency with only slight growth, indicating better scalability in temporal dimensions.

In real robot experiments, multiple long-sequence tasks were set up to validate the practical effectiveness. In a sequential button-pressing task, the baseline method achieved a success rate of 17.4%, while HiF-VLA improved this to 34.2%, nearly doubling the rate. In a covering and stacking task, the baseline was 33.3%, and HiF-VLA reached 57.9%, an increase of 24.6% percentage points. For a placing task, the baseline was approximately 62.5%, while HiF-VLA achieved around 65%, showing a smaller but more stable improvement. Researchers noted that the baseline method struggled to determine if a button had been pressed due to subtle state changes, while HiF-VLA effectively utilized temporal change information to identify state transitions, leading to better performance in complex tasks. This further emphasizes that incorporating temporal information significantly enhances robots’ decision-making abilities in long-sequence tasks.

The systematic comparison of time modeling methods involved careful arrangement of data and task designs. In simulated environments, ten long-sequence tasks from the LIBERO dataset and cross-environment generalization tasks from the CALVIN dataset were employed. In real robot experiments, each task collected 100 demonstration data points, executing each task 20 times during the testing phase to evaluate the model’s stability and generalization capabilities. For input information design, the model received three types of information: the current visual frame for current state perception, historical motion representing past dynamic changes, and language instructions providing task goals. This setup enables the model to make joint decisions across both temporal and semantic dimensions.

In the comparative experimental design, the research team established various methods for systematic comparison. The first method relied solely on current observational information for decision-making, excluding any temporal information. The second method introduced time information by stacking historical images, although this approach suffers from significant redundancy and high computational costs. The third method guided decision-making by predicting future images as sub-goals, which can lead to errors and instability. In contrast, the proposed method employs motion information instead of images to represent temporal changes, thus reducing redundant information and enhancing modeling efficiency.

In ablation experiments, the study further analyzed the impact of different design choices on performance. Initial experiments with historical length indicated that the optimal length is 8; shorter lengths fail to provide sufficient information while longer lengths introduce redundancy and affect model judgment. Additionally, in terms of how historical information is utilized, researchers compared two strategies: directly inputting historical information into the visual language model, achieving a success rate of 92.8%, versus injecting historical information into the decision module, which raised the success rate to 94.4%. This result suggests that directly incorporating historical information into the visual language model disrupts its inherent visual and language comprehension processes, while integrating historical information during the decision stage enhances its effectiveness.

This research addresses a core problem: traditional models often rely solely on current observations during decision-making, neglecting temporal information, which results in incoherent actions and increased failure rates in long-sequence tasks. Researchers assert that the fundamental issue is not a lack of visual capabilities but rather an absence of time modeling abilities. This understanding led to an important finding: motion information is more suitable for representing temporal changes than images. This is because images contain a significant amount of static information, whereas motion information retains only the genuinely changed aspects, making it more efficient and expressive. This discovery directly impacts robotics research, transforming the previously linear process from perception to action into a decision-making process that simultaneously considers the past, present, and future—shifting from simple perception to action driven by past states, current conditions, and future predictions.

In terms of engineering value, experimental results indicate that this method not only shows significant performance improvements, such as a maximum success rate of 96.4%, but also offers computational efficiency, avoiding the potential threefold computational costs seen in traditional methods. Moreover, this approach demonstrates stronger generalization capabilities across different environments and proves effective in real robot experiments, highlighting its substantial practical application potential.

Furthermore, this research promotes a new intelligent paradigm, transitioning from “see and act” visual language action models to “think while acting” world action models. HiF-VLA not only changes the structural design of models but also redefines the capabilities robots should possess. Past systems functioned more like passive responders, reacting instantaneously to current inputs; in this new paradigm, robots begin to develop continuous decision-making capabilities, remembering recent actions, assessing their current stage, and predicting subsequent steps. This change signifies that robots are no longer limited to executing single-step actions but can comprehend an entire process, continually adjusting their behavior throughout. It also implies that the development of embodied intelligence is evolving from a “perception-driven reactive system” to a “time-driven reasoning system.” When models truly acquire this capability, robots can operate stably in complex, dynamic real-world environments rather than merely completing pre-defined tasks in controlled scenarios.

The paper’s corresponding author, Wang Donglin, is currently the Deputy Director of the Department of Artificial Intelligence at Westlake University. He is the founder and head of the Machine Intelligence Laboratory (MiLAB) and the founder of Westlake Robotics Technology (Hangzhou) Co., Ltd. Wang received his bachelor’s and master’s degrees in Electronic Information Engineering from Xi’an Jiaotong University before earning his PhD in Electronic and Computer Engineering from the University of Calgary in Canada, where he also conducted postdoctoral research. He later taught at New York Institute of Technology, rising to the rank of Associate Professor, before returning to China in 2017 to join Westlake University as one of its first full-time faculty members and establishing MiLAB. Wang also serves as the Chief Scientist for the National Major Project for Scientific and Technological Innovation 2030 and has been selected for the National High-Level Talent Program, playing a significant role in national-level research projects. His research primarily focuses on robotic learning and intelligent decision-making, emphasizing reinforcement learning, meta-learning, and robotic behavioral intelligence, with the goal of enabling robots to autonomously learn, adapt to new environments quickly, and accomplish complex tasks. His work emphasizes the complete feedback loop from perception to decision-making to action, particularly in stable execution capabilities during long-sequence tasks and real-world environments. He has published over one hundred papers and is active in cutting-edge fields like robotic learning and reinforcement learning, contributing to international academic community building. His team is one of the earliest in China to focus on robotic learning, proposing the first VLA large model for quadrupedal robots, the VLA large model for humanoid robots, and reward-agnostic human feedback reinforcement learning. His recent collaboration on an AAAI 2026 paper won the Best Paper Award, while the general behavior expert large model GAE he developed also reached an international leading level in humanoid robot movement.

The other corresponding author, Huang Siteng, is currently an algorithm expert at Alibaba’s DAMO Academy. He earned his PhD from the joint training program between Zhejiang University and Westlake University, where he completed his doctoral research at MiLAB under the supervision of Professor Wang Donglin. Prior to this, he obtained his bachelor’s degree in Computer Science from Wuhan University. During his PhD, he engaged in long-term research internships at Alibaba’s Tongyi Laboratory and DAMO Academy, subsequently joining DAMO Academy for algorithm research. His experience spans both academic research and industry practice. Huang’s research focuses primarily on embodied intelligence, multimodal large models, and efficient artificial intelligence, concentrating on how to enable models to understand images, videos, languages, and dynamic information in the physical world simultaneously, as well as perform perception, reasoning, and generation in real environments. His work also emphasizes efficiency optimization in data, computation, and storage, aiming to build unified intelligent systems capable of efficient operation in the real world. He has published over thirty papers in related fields, covering computer vision, multimodal learning, and robotics, and is active in top international conferences and journals. Additionally, he has participated in several research projects in embodied intelligence and multimodal model directions, including vision-language action models and unified world models, with representative works involving HiF-VLA, RynnVLA series, and WorldVLA frameworks, pushing the capabilities of robots in long-sequence tasks and real-world environments.

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/westlake-universitys-novel-approach-to-robot-decision-making-integrating-historical-current-and-future-insights-in-action-models/

Westlake University’s Novel Approach to Robot Decision-Making: Integrating Historical, Current, and Future Insights in Action Models

相关推荐