
Imagine a scenario where a robot reaches for a cup on a table, picks it up, then suddenly stops and places it back down, only to try grabbing it again. This repetitive behavior, as if the robot has forgotten its previous actions, is not uncommon in real-world environments. For instance, a button that has already been pressed may be pressed repeatedly, or a drawer that is closed may continue to be pushed. These failures are not due to a lack of visual clarity but stem from the absence of a “world model” that can simulate the evolution of time and space.
Current vision-language action models can understand images and commands, but they often rely solely on immediate observations for decision-making in continuous tasks. When tasks require multiple steps—such as picking up an object, moving it, placing it down, and then closing a device—issues like repetitive actions and decision interruptions frequently arise. This fundamental problem lies in the model’s inability to comprehend time, which has become a critical bottleneck in the development of embodied intelligence.
Many existing methods operate on an “act on what you see” basis, performing well in short tasks but struggling with coherence and decision drift in longer sequences. Consequently, enabling models to not only perceive the current state but also remember past actions and anticipate future outcomes presents a new core challenge.
In this context, the team led by Wang Donglin at Westlake University introduced the paper titled HiF-VLA: Hindsight, Insight, and Foresight for Vision-Language-Action Models. This research moves beyond simply relying on historical images or predictions of future visuals by centering on “movement” as the core representation of temporal information. This approach allows the model to simultaneously model past changes, current states, and future trends, thereby achieving more stable continuous decision-making.
The significance of this research extends beyond performance improvements; it establishes a new paradigm shifting robots from “passive reactions” to “thinking while acting.” As embodied intelligence gradually transitions into real-world applications, the ability to understand time becomes a key factor determining whether a system is truly usable.
The paper can be accessed at: HiF-VLA Paper.
In the LIBERO-Long task, which primarily tests whether a robot can successfully perform a series of actions, such as picking up objects, placing them, and shutting devices, the results showed that HiF-VLA achieved a success rate of 94.4% under single-view conditions and 96.4% under multi-view conditions. In contrast, the current strong method, OpenVLA-OFT, attained 91.0% in single-view and 94.0% in multi-view settings. This indicates a 3.4 percentage point improvement in single-view performance and a 2.4 percentage point enhancement in multi-view scenarios. Furthermore, in ten specific tasks, several reached a success rate of 100%, with the lowest at 76%, demonstrating overall stability rather than reliance on outlier tasks to boost average performance.
Notably, the performance of this method under single-view conditions is nearing or even matching that of other methods under multi-view conditions, suggesting that the improvement primarily stems from enhanced temporal modeling capabilities rather than an increased reliance on visual information or the number of cameras.
In the CALVIN cross-environment generalization task, where models were trained in environments A, B, and C and tested in an unseen environment D, the evaluation metric was the number of consecutive tasks successfully completed without interruption. The results indicated that this method achieved 4.08 in single-view and 4.35 in multi-view, compared to OpenVLA-OFT’s approximately 4.10 and Seer’s 4.28. Thus, this method reached the highest score of 4.35 in multi-view conditions, representing an approximate improvement of 0.25 tasks over the baseline. This increase is significant as a failure at any intervening step disqualifies subsequent tasks, meaning a higher score reflects stronger stability in long-term continuous decision-making and better long-term planning abilities.
Regarding efficiency and computational cost, the study further analyzed whether performance enhancements came at the expense of computational overhead. The results indicated that when introducing image-based future subgoal predictions, the success rate was 91.8%, but the delay increased to 115.9 milliseconds, 1.59 times slower than the baseline. Conversely, when using historical frame stacking, the success rate dropped to 90.4%, with a delay of 229.5 milliseconds, 3.15 times that of the baseline. This suggests that large amounts of image data not only incur high computational costs but also disrupt model judgment. In contrast, this method, when only incorporating future reasoning, achieved a success rate of 92.2% with a delay of 82.7 milliseconds, incurring almost no additional overhead. With only historical information, the success rate remained at 92.2%, with a delay of 117.7 milliseconds. When both were included, the success rate reached 93.2% with a delay of 121.6 milliseconds. Overall, this method enhanced success rates while maintaining computational costs significantly lower than those associated with historical frame stacking, indicating that using motion information is more efficient than relying solely on historical image data.
In terms of temporal length extension capabilities, the study gradually increased the historical length from 4 to 8, then to 16 and 32. Results indicated that performance peaked at a length of 8, achieving 94.4% in single-view and 96.4% in multi-view conditions. Further increases in length led to performance declines due to redundancy from excessive information. Regarding delay, traditional methods saw linear increases in computational costs with historical length, with a delay increase of about 4.5 times at a length of 8, while this method maintained stability with only slight growth, demonstrating superior scalability in the temporal dimension.
In real robot experiments, multiple long sequence tasks were set up to verify practical effectiveness. In the button-pressing task, the baseline method recorded a success rate of 17.4%, while this method increased it to 34.2%, nearly doubling the rate. In the covering and stacking task, the baseline achieved 33.3%, while this method reached 57.9%, an improvement of 24.6 percentage points. In the placement task, the baseline was approximately 62.5%, and this method reached around 65%, showing a smaller but more stable improvement. Researchers noted that the baseline method struggled to determine if a button had already been pressed due to subtle state changes, while this method effectively utilized temporal change information to identify state transitions, leading to better performance in more complex tasks. This further underscores that incorporating temporal information significantly enhances a robot’s decision-making capabilities in long sequence tasks.
The systematic comparison of temporal modeling methods involved organizing data and task designs carefully. In the simulated environment, ten long sequence tasks from the LIBERO dataset and cross-environment generalization tasks from the CALVIN dataset were used. In real robot experiments, 100 demonstration data points were collected for each task, and each task was executed 20 times during the testing phase to evaluate the model’s stability and generalization ability. For input information design, the model simultaneously received three types of information: the current visual input for perception of the current state, historical motion to express past dynamic changes, and language instructions to provide task goals, thus enabling the model to make joint decisions across temporal and semantic dimensions.
In the comparative experiment design, the research team established various methods for systematic comparison. The first method relied solely on current observational information for decision-making, omitting any temporal data. The second method introduced time information through stacking historical images, though this approach encountered significant redundancy and high computational costs. The third method guided decisions by predicting future images as subgoals, but this method was prone to errors and exhibited less stability. In contrast, the proposed method used motion information to represent temporal changes instead of images, thus reducing redundant information and enhancing modeling efficiency.
In the ablation experiments, the research further analyzed how different design choices affected performance. Initially, experiments were conducted on historical length, revealing an optimal length of 8. Lengths that were too short failed to provide sufficient information, while overly lengthy ones introduced redundancy that impaired model judgment. Next, regarding the usage strategy of historical information, two strategies were compared—one directly inputting historical information into the vision-language model, achieving a success rate of 92.8%, and the other injecting historical information into the decision module, which increased the success rate to 94.4%. This finding indicates that directly incorporating historical information into the vision-language model can disrupt its original visual and linguistic comprehension processes, while introducing historical information during the decision phase allows it to be utilized more effectively.
This research primarily addresses a core issue: traditional models often rely solely on current observations during decision-making, neglecting temporal information, which leads to incoherent actions and increased failure rates in long sequence tasks. Researchers emphasize that the root of the problem lies not in insufficient visual capabilities but in a lack of time modeling ability. Based on this understanding, the study presents a significant discovery: motion information is more suitable than images for representing temporal changes. This is because images contain a wealth of static information, while motion information retains only the elements that have genuinely changed, making it more efficient and expressive. This finding has a direct impact on robotics research, transforming a previously unidirectional process from perception to action into a decision-making process that considers the past, present, and future simultaneously.
In terms of engineering value, experimental results indicate that this method not only achieves significant performance improvements, with a maximum success rate reaching 96.4%, but also has advantages in computational efficiency, avoiding the potential threefold computational overhead of traditional methods. Additionally, this method demonstrates stronger generalization capabilities in various environments and remains effective in real robot experiments, underscoring its practical application potential.
Furthermore, this research has propelled a new intelligent paradigm, transitioning from “act on what you see” vision-language action models to “think while acting” world action models. HiF-VLA not only alters the structural design of models but also redefines the capabilities that robots should possess. Previous systems functioned more like passive responders, reacting immediately to current inputs; under this new paradigm, robots can now engage in continuous decision-making, remembering what just occurred, determining their current stage, and anticipating what to do next. This shift signifies that robots are no longer limited to executing single-step actions but can comprehend entire processes and continuously adjust their behaviors throughout. This transition indicates that the development of embodied intelligence is moving from “perception-driven reactive systems” to “time-driven reasoning systems.” When models truly possess this capability, robots will be able to operate reliably in complex, dynamic real-world environments, rather than merely completing preset tasks in controlled settings.
The lead author of the paper, Wang Donglin, currently serves as the Deputy Director of the Department of Artificial Intelligence at Westlake University. He is the founder and head of the Machine Intelligence Laboratory (MiLAB) and the founder of Westlake Robotics Technology (Hangzhou) Co., Ltd. He holds both his bachelor’s and master’s degrees in Electronic Information Engineering from Xi’an Jiaotong University, then earned his Ph.D. in Electrical and Computer Engineering from the University of Calgary in Canada, followed by postdoctoral research in Canada. He later taught at the New York Institute of Technology and rose to the rank of associate professor. In 2017, he returned to China to join Westlake University as one of the first full-time faculty members in the College of Engineering and established MiLAB. He also serves as the chief scientist for the National Science and Technology Innovation 2030 Major Project and has been selected for the National High-level Talent Program, playing a pivotal role in national research projects. His research focuses on robot learning and intelligent decision-making, with an emphasis on reinforcement learning, meta-learning, and intelligent robotic behavior, aiming to enable robots to autonomously learn, quickly adapt to new environments, and complete complex tasks. His work not only addresses perceptual understanding but also emphasizes the complete feedback loop from perception to decision-making to action, especially concerning stability in long sequence tasks and real-world execution capabilities. He has published over a hundred papers and is active in cutting-edge fields like robot learning and reinforcement learning, contributing to the development of international academic communities. His team is one of the earliest in China to focus on robot learning, proposing significant models, including the first quadrupedal robot VLA large model and humanoid robot VLA large model. His recent work, presented at AAAI 2026, won the Best Paper Award, while the General Behavior Expert Large Model (GAE) reached international leading standards in humanoid robotic movement.
Another co-author, Huang Siteng, is currently an algorithm expert at Alibaba’s DAMO Academy. He obtained his Ph.D. through a joint training program between Zhejiang University and Westlake University, completing his doctoral research in the Machine Intelligence Laboratory under the guidance of Professor Wang Donglin. Previously, he earned his undergraduate degree in Computer Science from Wuhan University. During his doctoral studies, he also engaged in long-term research internships at Alibaba’s Tongyi Lab and DAMO Academy before joining DAMO Academy for algorithm research. His research primarily focuses on embodied intelligence, multi-modal large models, and efficient artificial intelligence, with a core interest in enabling models to simultaneously understand images, videos, language, and dynamic information from the physical world, while achieving perception, reasoning, and generation in real environments. His work encompasses both multi-modal understanding and generation and emphasizes efficiency optimization in data, computation, and storage, aiming to construct unified intelligent systems that operate efficiently in the real world. He has published over thirty papers in related fields, covering computer vision, multi-modal learning, and robotics, and is active in top international conferences and journals. He is also involved in multiple research projects related to embodied intelligence and multi-modal models, including visual-language action models and unified world models, with representative works such as HiF-VLA, RynnVLA series, and WorldVLA, which have advanced robot capabilities in long sequence tasks and real-world applications.
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/advancing-robotics-westlake-universitys-team-introduces-hif-vla-for-enhanced-long-sequence-task-performance/
