
Imagine a 1.3-meter tall humanoid robot that can walk like a human, accurately place pillows in an unfamiliar bedroom, assist in a convenience store by putting toys into a shopping cart, and even throw trash into a bin in an outdoor garden. While this may sound like a scene from a science fiction movie, recent research is making it a reality. A collaborative research team from the University of Hong Kong, Shanghai Innovation Research Institute, Beihang University, and Kinetix AI has proposed the EgoHumanoid framework. This framework utilizes the PICO VR headset to record human demonstrations in both indoor and outdoor environments in a low-cost and unconstrained manner. It systematically explores how to use “self-centered” demonstration videos from daily human activities to train humanoid robots to perform complex whole-body movement tasks.
Traditionally, training humanoid robots involves researchers demonstrating tasks repeatedly in a confined laboratory setting through remote operation. This method is not only expensive and complex but also suffers from significant limitations. Due to hardware and safety restrictions, it is challenging for robots to collect data in diverse real-world settings such as homes, stores, and parks. As a result, these robots often struggle to adapt when placed in unfamiliar environments, exhibiting poor generalization capabilities.
The core insight of EgoHumanoid is that humans naturally perform movement tasks in various environments where robots are expected to operate daily. Why not let robots learn directly from humans? Using the PICO headset and PICO trackers, the research team can easily record human behaviors in a variety of settings. A few minutes of video showing a human throwing trash in a park or arranging pillows in a bedroom provides valuable learning material for the robot. This method of data collection is over twice as efficient as traditional remote operation techniques.
However, making robots imitate human actions is not straightforward. There are significant differences in arm length, body proportions, viewpoint height, and walking posture between humans and robots. Directly copying human movements is not feasible. To address this, the EgoHumanoid team designed a sophisticated alignment process that consists of two main components:
- Viewpoint Alignment: Using advanced depth estimation and image generation techniques, the team converts the world as seen by humans (first-person perspective images) into the world as perceived by robots.
- Action Alignment: This involves mapping complex human movements onto a unified action space that the robot can execute.
Through this alignment, a unified Vision-Language-Action (VLA) model can be trained collaboratively on a large amount of human data and a smaller amount of robot data.
In practical tests on the real Unitree G1 humanoid robot, EgoHumanoid was validated on four challenging movement tasks (placing pillows, handling trash, transferring toys, and loading shopping carts). The results showed that compared to a baseline model trained solely on robot data, EgoHumanoid not only performed better in familiar laboratory environments but also exhibited a remarkable 51% improvement in entirely new settings, such as actual bedrooms and convenience stores.
For more information, you can access the relevant paper: EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration.
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/egohumanoid-framework-enhances-robot-generalization-using-pico-technology/
