Hong Kong University of Science and Technology Develops Robot System with Human-Like Vision

A research team at the Hong Kong University of Science and Technology has made a significant breakthrough in robot vision, as detailed in their recent paper presented at the 2026 Conference on Computer Vision and Pattern Recognition (CVPR), with the paper ID arXiv:2603.23478v1. This research has introduced a system called UniFunc3D, which enables robots to possess human-like visual understanding.

Imagine entering an unfamiliar room and being told, “Open the left corner drawer of the cabinet next to the television.” How would you proceed? You would likely scan the room with your eyes, locate the television, identify the nearby cabinet, recognize the left corner section of that cabinet, and finally, find the drawer handle and operate it. This seemingly simple task involves complex visual understanding, spatial reasoning, and functional judgment. The UniFunc3D system is designed to mimic this human cognitive process.

What sets UniFunc3D apart is its ability to not only recognize objects but also understand how to interact with them. This system gives robots a “social awareness,” allowing them to discern which parts of an object are meant for interaction—such as identifying the handle of a cabinet drawer, rather than just seeing the entire cabinet.

Traditional robotic vision systems face fundamental challenges, often acting like “nearsighted” individuals who follow predefined instructions without the flexibility to comprehend complex spatial descriptions and functional requirements. For instance, when given the instruction to “plug in the device behind the left socket,” current systems may confuse the target object, erroneously identifying it as the “device” instead of the “socket.”

The innovation of the UniFunc3D system lies in its implementation of a “scan-then-focus” strategy, akin to human observation habits. When searching for something, humans typically take an initial broad scan of the environment before honing in on potential areas for closer inspection. Similarly, the UniFunc3D system quickly scans a video scene at low resolution to identify likely target areas, then automatically switches to high-resolution mode for precise localization.

Remarkably, this system also features a “self-verification” capability. After identifying a specific area, it acts like a meticulous craftsman, double-checking to ensure the identification is accurate. This design significantly reduces the chances of incorrect identifications, enhancing the reliability of robotic operations.

Addressing the “Visual Blind Spot” of Robots

To fully appreciate the significance of this research, it’s essential to understand the limitations faced by current robotic systems. Most existing robotic vision systems function like an overly specialized factory assembly line, where different “workers” are responsible for different tasks: one interprets textual instructions, another searches for objects in images, and a third makes final operation decisions. While this division of labor seems logical, it poses a critical problem: the first “worker” cannot see the actual scene when interpreting instructions. This is reminiscent of a blind person directing someone to find an object in a room.

For example, when instructed to “plug in the device behind the left socket,” this “blind director” must guess based solely on text, potentially misinterpreting it as the “device” rather than the “socket.” Additionally, these systems are often clumsy in choosing vantage points, relying on rudimentary rules to determine how to observe the scene, such as simply selecting the most centered image of an object.

Another limitation is the lack of “zoom” capabilities. Humans instinctively move closer to examine small objects or squint to focus on details. However, existing systems can only process images at fixed resolutions, rendering them ineffective against tiny functional components that may be represented by just a few pixels in the overall scene.

The research team at the Hong Kong University of Science and Technology has analyzed these issues and traced them back to a common core flaw: the absence of a unified reasoning system equipped with visual perception capabilities. Current methods resemble a group of mute individuals communicating via written notes, leading to inefficiencies and distorted information transfer.

The “Human Vision” Solution of UniFunc3D

In response to these challenges, the UniFunc3D system adopts a fundamentally different approach: it incorporates a visual-capable “brain” to unify all tasks. This intelligent assistant can both see and think, replacing the previous team of multiple blind individuals. At its core is a multimodal large language model that comprehends both text instructions and visual content. Crucially, it integrates language understanding and visual perception to perform reasoning.

When given a command like “open the left corner drawer of the cabinet containing beauty products,” the system doesn’t make blind guesses; it actively observes the scene to identify the cabinet with beauty products and precisely locates the left corner drawer handle.

The observation strategy mimics human visual habits. When searching for a specific object in complex environments, people typically follow a “scan-focus” pattern. UniFunc3D works similarly: it first conducts multiple quick scans of the entire scene, starting from different time points, akin to viewing a room from various angles. This diverse observational approach ensures no crucial visual cues are overlooked.

During the rapid scanning phase, the system reduces image resolution to enhance processing speed while maintaining a broad field of view. The goal at this stage isn’t to see every detail but to identify general target areas—much like quickly scanning a new room to grasp the overall layout before focusing on a specific item.

Once candidate areas are identified, the system transitions into “focus mode.” It extracts time segments containing potential targets and processes these images at their original high resolution. This mirrors how a person would approach a likely target for a closer look or squint to clarify details. Notably, during high-resolution processing, the system does not “crop” images like traditional methods; it retains the full field of view. This design is crucial, as context from the surrounding environment often helps confirm the target. For example, to locate “the drawer next to the television,” one must see the relative positions of the television and cabinet.

Double Verification for Precision

Simply being able to locate a target isn’t sufficient; UniFunc3D also incorporates a sophisticated verification mechanism. This mechanism operates similarly to the “double reading” system used by doctors reviewing X-rays, where two independent experts assess to reduce the likelihood of misdiagnosis. After the system preliminarily identifies a target area, it employs a specialized segmentation algorithm to accurately delineate the edges of the target object, ensuring every pixel is correctly assigned.

The next critical step is verification. The system highlights the identified area in vivid colors and then “questions” itself: is this highlighted area truly the functional component I am looking for? It examines this judgment from multiple angles: first, confirming that the marked object is indeed of the correct type, such as a handle rather than a decorative item; and second, ensuring that the area is appropriately sized without including unrelated parts. This self-questioning mechanism is vital, as traditional systems frequently encounter “over-segmentation” issues, mistakenly marking entire drawers when searching for handles. UniFunc3D’s verification process can detect such errors, guaranteeing the accuracy of the final results.

Impressively, this verification process is fully automated, requiring no human intervention. The system automatically evaluates the quality of its recognition results based on predetermined standards, accepting only those that pass verification. This is akin to having an in-built “quality inspector” constantly monitoring work quality. Through this double assurance mechanism, UniFunc3D significantly enhances recognition accuracy. In practical tests, the system successfully identified complex scenes that traditional methods frequently misinterpret, such as correctly locating a specific cabinet among several similar ones or pinpointing a specific button on a crowded switch panel.

Multi-Angle Fusion for Comprehensive 3D Understanding

Simple 2D image recognition is insufficient for robots, as the real world is three-dimensional. Another innovation of UniFunc3D is its ability to cleverly merge information from multiple 2D perspectives into a complete 3D understanding. This process can be likened to a jigsaw puzzle, where each viewpoint provides a piece of the puzzle, which may appear incomplete when viewed individually but reveals a complete image when assembled.

The system collects observations from different time points and angles, then combines this fragmented information like an experienced puzzle master. During the fusion process, a “majority voting” strategy is employed. If a particular 3D spatial point is recognized as part of the target object from multiple different perspectives, its likelihood of being included in the final result increases. This method effectively filters out occasional recognition errors, enhancing the reliability of the overall results.

Considering that the reliability of different perspectives may vary, the system assigns different weights based on the quality of each viewpoint. For instance, if an image from a particular angle is exceptionally clear or contains more contextual information, the recognition result from that angle receives higher importance. This multi-angle fusion strategy proves particularly effective in dealing with partially occluded situations. In real-world environments, target objects are often obscured by other items, making it difficult for a single perspective to capture complete information. However, by synthesizing observations from multiple angles, the system can “navigate around” these obstructions to construct a complete 3D model of the target object.

Ultimately, the system outputs an accurate 3D segmentation result, clearly indicating which 3D spatial points belong to the target functional component. This result can be directly utilized for robotic path planning and action execution, facilitating genuine intelligent operations.

Experimental Results Validate Outstanding Performance

To validate the real-world effectiveness of UniFunc3D, the research team conducted comprehensive tests on the SceneFun3D dataset, which includes 230 high-resolution real indoor scenes and over 3,000 complex functional operation tasks—making it one of the most challenging benchmarks in the field. The experimental results were impressive. Compared to the best existing zero-training method, Fun3DU, UniFunc3D showed significant improvements in key metrics. In terms of precision under the stringent AP50 metric, the improvement reached 84.9%, indicating that the system’s recognition accuracy nearly doubled under strict standards. For the somewhat relaxed AP25 metric, the improvement was also notable at 53.2%. Even more astonishingly, UniFunc3D achieved a relative improvement of 59.9% in the mean Intersection over Union (mIoU) metric, which measures the overlap between recognized areas and true target areas, with a high score reflecting the system’s ability to accurately determine boundaries.

The advantages of UniFunc3D are even more pronounced when compared to methods requiring extensive training data. While those trained systems may have been optimized over extended periods for specific datasets, UniFunc3D consistently surpasses them in most metrics, demonstrating the superiority of its unified architecture design. Sometimes, a good design is more crucial than a vast amount of training data. UniFunc3D excels in handling challenging scenarios, such as the task of “opening the left corner drawer of the cabinet containing beauty products,” where the system must first recognize which cabinet holds the beauty products, then accurately locate the left corner position, and finally find the drawer handle. Traditional methods often falter in these complex spatial reasoning tasks, misidentifying cabinets or confusing directions. In contrast, UniFunc3D can reliably complete these tasks, showcasing an understanding ability close to that of humans.

Moreover, the system performs exceptionally well when dealing with small functional components. Many practical operational targets are small, like switch buttons, socket holes, or small handles, which may occupy a tiny proportion of the overall scene. Traditional methods frequently struggle to accurately identify these minute targets, but UniFunc3D’s “zoom” mechanism effectively manages such challenges.

Efficiency Advantages Significantly Enhanced Practicality

In addition to accuracy improvements, UniFunc3D shows significant advantages in processing efficiency. Under the same hardware conditions, this system operates 3.2 times faster than existing best methods, reducing the processing time for each scene from 82 minutes to just 26 minutes. This efficiency boost stems from the ingenious design of the system. Traditional methods require running multiple different models, each needing to be loaded and executed separately, much like launching various applications to complete a single task. However, UniFunc3D only needs to run one unified model, eliminating the overhead of model switching and data transfer.

More importantly, the system’s “coarse-to-fine” strategy significantly reduces the number of images that require high-resolution processing. During the rough scanning phase, the system quickly locates candidate areas using lower resolution and only switches to high-resolution processing after confirming the target location. This approach avoids the massive overhead of processing all images at full resolution. The system further enhances efficiency through intelligent time window selection, analyzing only the most informative frames based on content changes in the video. This is akin to an experienced photographer knowing the right moment to press the shutter; the system can identify the most valuable observations. Such efficiency improvements are crucial for practical applications, as response speed can be just as important as accuracy in real robotic systems. Users do not want to wait over an hour to see results after issuing commands to a robot. The high efficiency of UniFunc3D enables real-time or near-real-time applications, greatly enhancing user experience.

The Clever Design of the System

The success of UniFunc3D heavily relies on its clever system design. Unlike traditional “modular” methods, this system adopts an “integrated” design philosophy, akin to sculpting an artwork from a single piece of wood rather than gluing different parts together. The core of the system is a meticulously designed reasoning chain. Upon receiving a task instruction, the system does not simply decompose it into independent subtasks; it maintains an understanding of the overall goal at each step. This design avoids the common “error accumulation” issue found in traditional methods, where small mistakes in earlier steps are magnified in subsequent processing.

In processing multimodal information, the system employs an “interwoven” fusion strategy. Textual and visual information are not handled separately and merged later; rather, they interact deeply at every stage of processing. It resembles two experienced detectives discussing clues while simultaneously observing the scene, rather than one person being responsible for observation and the other for reasoning. The system also possesses powerful adaptability; it can automatically adjust its processing strategies according to different types of tasks and scene complexities. For simpler tasks, the system may converge faster to results, while for more complex scenes, it will automatically increase the number of observation angles and detail levels.

Notably, the entire system is entirely “training-free,” meaning it requires no additional training or tuning for specific tasks. This design significantly lowers the deployment threshold, facilitating application in various scenarios. Users do not need to prepare extensive training data or conduct complex model optimizations to achieve excellent performance.

A Thorough Analysis of Component Contributions

To better understand the reasons behind UniFunc3D’s success, the research team conducted detailed component analysis experiments. These experiments are akin to disassembling a precision machine and examining each part’s role to comprehend the source of overall performance. Firstly, the team validated the advantages of “two-stage processing” over “single-stage processing.” Experimental results showed that directly using high resolution for all images, while capturing more details, actually led to poorer outcomes. This is because single-stage methods struggle to effectively process long sequence information and lack global vision guidance, easily getting lost in details. In contrast, the “low resolution first, high resolution later” two-stage strategy showed remarkable performance. In the low-resolution phase, the system quickly gains a global understanding, determining general target areas. Then, in the high-resolution phase, it can concentrate on these candidate regions, ensuring clarity in details while avoiding information overload.

The importance of multiple sampling strategies was also validated by experiments. Performance declines significantly if only one observation is conducted since a single viewpoint may miss crucial information. By performing multiple samples starting from different temporal offsets, the system can comprehensively cover the entire scene, greatly increasing the probability of locating the target. The time window processing mechanism yielded the most significant performance improvements. When the system expanded from single-frame processing to multi-frame time window processing, the AP50 metric improved by over five percentage points, and the AP25 metric increased by more than ten percentage points. This proves that temporal context information is vital for accurately understanding the functional aspects of 3D scenes. The role of the verification mechanism is equally crucial. By visually inspecting recognition results, the system can filter out many incorrect candidate results. This is particularly evident when there are numerous candidates, as the verification mechanism can accurately select the correct target from many options.

Interestingly, the performance improvement was most significant when the sampling frequency increased from one to two. Continued increases to four samples resulted in further enhancements, but with diminishing returns. When sampling was increased to eight, improvements became minimal. This finding provides essential insights for practical deployment: four samples represent the optimal balance between effectiveness and efficiency.

Future Directions for Technological Development

While UniFunc3D has achieved remarkable results, the research team is aware of existing technological limitations. The system still faces challenges with extremely small functional components (occupying less than 0.1% of the image area) or severely occluded scenes. Future research may develop along several directions to address these challenges. One avenue is the creation of more intelligent “zoom” mechanisms. Although the current system can switch between different resolutions, this switching is relatively straightforward. Future developments may introduce more sophisticated attention mechanisms that allow for ultra-high-resolution processing of critical areas while maintaining a global view.

Another promising direction is the direct integration of explicit 3D geometric reasoning into the system. Current methods primarily rely on processing 2D images and then obtaining 3D understanding through multi-view fusion. Future systems may perform reasoning directly in 3D space, enabling them to handle complex spatial relationships and geometric constraints more directly. Interactive improvements also present an essential development direction. The current system is “one-off,” completing tasks once results are provided. However, in practical applications, users may need to fine-tune results or provide additional guidance. Future systems may support interactive refinement processes, allowing users to enhance recognition results through simple feedback.

Expanding to more diverse scenarios is also a vital research direction. Current studies focus primarily on indoor environments, yet robots can operate in many other contexts. Outdoor environments, industrial settings, and medical environments present unique challenges and demands. Adapting similar technologies to these various application scenarios is a question worth exploring.

Ultimately, UniFunc3D represents a significant milestone in the realm of robotic visual understanding. It not only achieves technical breakthroughs but also demonstrates a groundbreaking design philosophy: utilizing a unified, visually perceptive intelligent system to tackle complex multimodal tasks. This philosophy may influence the technological development of many other fields in the future. At its core, this research reveals an important direction for the evolution of robot intelligence. Future robots will not only execute commands but also comprehend complex environments and task requirements like humans. UniFunc3D has taken a substantial step in this direction, showcasing the potential for robots to possess “human-like vision.” For the general public, this implies that future smart homes and service robots will become significantly more intelligent and practical, capable of understanding more complex instructions and accurately executing a variety of detailed tasks. Interested readers can find the complete technical details through the paper ID arXiv:2603.23478v1 or follow the subsequent research developments from the relevant laboratories at the Hong Kong University of Science and Technology.

Q&A

Q1: How does the UniFunc3D system work?
A: UniFunc3D employs a “scan-then-focus” strategy similar to human observation habits, first using low resolution to quickly scan the entire scene for general target areas, then switching to high resolution for precise localization, and finally ensuring the accuracy of recognition results through a self-verification mechanism. The entire process is managed by a unified multimodal large language model, avoiding information loss common in traditional methods involving multiple systems.

Q2: What advantages does UniFunc3D have compared to existing methods?
A: UniFunc3D improves accuracy by 84.9% over the best existing zero-training methods and is 3.2 times faster, even surpassing specialized methods that require extensive training data. Notably, it can understand complex spatial descriptions like accurately locating “the left corner drawer of the cabinet next to the television,” which requires composite reasoning.

Q3: When can UniFunc3D technology be applied in daily life?
A: Although the technology has demonstrated excellent performance in experimental settings, further engineering development is needed for practical application in household robots. Nevertheless, this research points the way for the evolution of smart homes and service robots, paving the path for future robots to better understand and execute complex household instructions.

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/hong-kong-university-of-science-and-technology-develops-robot-system-with-human-like-vision/