2025: Zhejiang University Team Releases Open Source π³ Model – A Breakthrough in Visual Geometry Reconstruction, Outperforming VGGT!

On February 24, 2026, a new article was published highlighting the innovative contribution of researchers from Shanghai AI Lab and Zhejiang University. They introduced π³, a feedforward neural network that offers a novel approach to visual geometry reconstruction, breaking away from the reliance on traditional fixed reference views.

Introduction

Previous methods typically anchor reconstruction results to specified viewpoints. This inductive bias can cause instability and reconstruction failures when the reference views are suboptimal. In contrast, π³ employs a completely permutation-equivariant architecture, which allows it to predict affine-invariant camera poses and scale-invariant local point maps without any reference frame. This design grants the model robustness to input order and exceptional scalability. As a result, this straightforward and unbiased approach achieves state-of-the-art performance across various tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction.

Research Background and Related Work

Research Background

Visual geometry reconstruction is widely applied in fields such as augmented reality and robotics. Traditional methods rely on iterative optimization processes. Although advancements have been made with feedforward neural networks (e.g., DUSt3R), both traditional and modern methods depend on fixed reference views, which limits model performance and robustness. Poor reference view selection can lead to decreased reconstruction quality.

Related Work

Traditional 3D Reconstruction

Classic methods like Structure from Motion (SfM) and Multi-View Stereo (MVS) establish feature correspondences through multi-view geometry, estimating camera poses and generating 3D point clouds. However, they rely on complex multi-stage processes and time-consuming iterative optimizations (e.g., Bundle Adjustment).

Feedforward 3D Reconstruction

Feedforward models can directly regress 3D structures from images. Dust3R handles image pairs but requires global alignment for expansion. Fast3R supports multi-image inference, FLARE decomposes tasks, and VGGT utilizes multi-task learning; however, all these methods anchor 3D structures to reference frames.

Methodology

Permutation-Equivariant Architecture

The core design goal is to ensure that model outputs are unaffected by the input view order, achieving robustness to input permutation. Given an input sequence of N images, S = (I₁, …, I_N), the network φ maps this sequence to output tuples.

The outputs include camera poses T_i, pixel-aligned 3D point maps X_i, and confidence maps C_i. The architecture guarantees that for any permutation operator P_π, the relationship φ(P_π(S)) = P_π(φ(S)) holds, ensuring a one-to-one correspondence between images and outputs.

Scale-Invariant Local Geometry

For each image I_i, the model predicts 3D point maps Xⁱ defined in its local coordinate system, sharing an unknown yet consistent scale factor across all images to address monocular reconstruction’s scale ambiguity.

Affine-Invariant Camera Pose

The camera poses Tⁱ are defined only under similarity transformations. Relative pose supervision eliminates global reference frame ambiguity, with the relative pose calculated as Tⁱ←_j = Tⁱ⁻¹T^j.

Model Training

The composite loss function integrates multiple task losses to optimize performance, trained on 15 diverse datasets covering indoor and outdoor scenes, including GTASfM, CO3D, and ScanNet.

Experimental Results

Camera Pose Estimation

Evaluations were conducted on multiple datasets, assessing angle accuracy and distance error. Results showed that on the RealEstate10K dataset, π³ achieved RRA@30 of 99.99 and RTA@30 of 95.62, establishing a new state-of-the-art for zero-shot generalization.

Point Map Estimation

The quality of reconstruction point maps was assessed across various conditions. In object-level and scene-level datasets, π³ outperformed VGGT, with notable improvements in accuracy and completeness metrics.

Video Depth Estimation

In experiments with datasets like Sintel and KITTI, π³ consistently achieved new state-of-the-art performance across all settings, demonstrating superior inference speed and lower error rates compared to prior methods.

Robustness Evaluation

Robustness to input sequence permutations was evaluated, revealing that π³ maintained near-zero standard deviation across all metrics, confirming the effectiveness of its permutation-equivariant design.

Conclusion and Future Work

Conclusion

The proposed π³ model is a fully permutation-equivariant feedforward neural network that predicts affine-invariant camera poses and scale-invariant local point maps, freeing visual geometry reconstruction from the constraints of fixed reference views. It has shown state-of-the-art performance in various tasks across multiple datasets.

Future Work

Future optimizations could address the model’s limitations, such as improving handling of transparent objects, enhancing geometric reconstruction details, and reducing artifacts in point cloud generation. Additionally, exploring its application in dynamic scenes and complex environments could further extend its practical value in areas like augmented reality and robotic navigation.

Quick Start

Getting Started

Clone the repository and install the required dependencies:

        git clone https://github.com/yyfz/Pi3.git
        cd Pi3
        pip install -r requirements.txt

Run inference using example scripts that support processing image directories or video files. For faster model checkpoint downloads, manual download is suggested.

Detailed Usage Instructions

The model accepts image tensors as input and outputs a dictionary containing reconstructed geometric information. Each input tensor should have a shape of B×N×3×H×W, where B is the batch size, N is the number of images, and H and W are the height and width of the images.

Example Code Snippet

        import torch
        from pi3.models.pi3 import Pi3
        from pi3.utils.basic import load_images_as_tensor
        
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        model = Pi3.from_pretrained("yyfz233/Pi3").to(device).eval()
        
        imgs = load_images_as_tensor('your_data_path', interval=10).to(device)
        
        with torch.no_grad():
            results = model(imgs[None])
        print("Reconstruction completed!")

Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/breakthrough-in-visual-geometry-reconstruction-zhejiang-university-team-releases-%cf%80%c2%b3-model-outperforming-vggt/