
2025: Zhejiang University Team Releases Open Source π³ Model – A Breakthrough in Visual Geometry Reconstruction, Outperforming VGGT!
On February 24, 2026, a new article was published highlighting the innovative contribution of researchers from Shanghai AI Lab and Zhejiang University. They introduced π³, a feedforward neural network that offers a novel approach to visual geometry reconstruction, breaking away from the reliance on traditional fixed reference views.
Introduction
Previous methods typically anchor reconstruction results to specified viewpoints. This inductive bias can cause instability and reconstruction failures when the reference views are suboptimal. In contrast, π³ employs a completely permutation-equivariant architecture, which allows it to predict affine-invariant camera poses and scale-invariant local point maps without any reference frame. This design grants the model robustness to input order and exceptional scalability. As a result, this straightforward and unbiased approach achieves state-of-the-art performance across various tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction.
Research Background and Related Work
Research Background
Visual geometry reconstruction is widely applied in fields such as augmented reality and robotics. Traditional methods rely on iterative optimization processes. Although advancements have been made with feedforward neural networks (e.g., DUSt3R), both traditional and modern methods depend on fixed reference views, which limits model performance and robustness. Poor reference view selection can lead to decreased reconstruction quality.
Related Work
Traditional 3D Reconstruction
Classic methods like Structure from Motion (SfM) and Multi-View Stereo (MVS) establish feature correspondences through multi-view geometry, estimating camera poses and generating 3D point clouds. However, they rely on complex multi-stage processes and time-consuming iterative optimizations (e.g., Bundle Adjustment).
Feedforward 3D Reconstruction
Feedforward models can directly regress 3D structures from images. Dust3R handles image pairs but requires global alignment for expansion. Fast3R supports multi-image inference, FLARE decomposes tasks, and VGGT utilizes multi-task learning; however, all these methods anchor 3D structures to reference frames.
Methodology
Permutation-Equivariant Architecture
The core design goal is to ensure that model outputs are unaffected by the input view order, achieving robustness to input permutation. Given an input sequence of N images, S = (I1, …, IN), the network φ maps this sequence to output tuples.
The outputs include camera poses Ti, pixel-aligned 3D point maps Xi, and confidence maps Ci. The architecture guarantees that for any permutation operator Pπ, the relationship φ(Pπ(S)) = Pπ(φ(S)) holds, ensuring a one-to-one correspondence between images and outputs.
Scale-Invariant Local Geometry
For each image Ii, the model predicts 3D point maps Xi defined in its local coordinate system, sharing an unknown yet consistent scale factor across all images to address monocular reconstruction’s scale ambiguity.
Affine-Invariant Camera Pose
The camera poses Ti are defined only under similarity transformations. Relative pose supervision eliminates global reference frame ambiguity, with the relative pose calculated as Ti←j = Ti⁻1Tj.
Model Training
The composite loss function integrates multiple task losses to optimize performance, trained on 15 diverse datasets covering indoor and outdoor scenes, including GTASfM, CO3D, and ScanNet.
Experimental Results
Camera Pose Estimation
Evaluations were conducted on multiple datasets, assessing angle accuracy and distance error. Results showed that on the RealEstate10K dataset, π³ achieved RRA@30 of 99.99 and RTA@30 of 95.62, establishing a new state-of-the-art for zero-shot generalization.
Point Map Estimation
The quality of reconstruction point maps was assessed across various conditions. In object-level and scene-level datasets, π³ outperformed VGGT, with notable improvements in accuracy and completeness metrics.
Video Depth Estimation
In experiments with datasets like Sintel and KITTI, π³ consistently achieved new state-of-the-art performance across all settings, demonstrating superior inference speed and lower error rates compared to prior methods.
Robustness Evaluation
Robustness to input sequence permutations was evaluated, revealing that π³ maintained near-zero standard deviation across all metrics, confirming the effectiveness of its permutation-equivariant design.
Conclusion and Future Work
Conclusion
The proposed π³ model is a fully permutation-equivariant feedforward neural network that predicts affine-invariant camera poses and scale-invariant local point maps, freeing visual geometry reconstruction from the constraints of fixed reference views. It has shown state-of-the-art performance in various tasks across multiple datasets.
Future Work
Future optimizations could address the model’s limitations, such as improving handling of transparent objects, enhancing geometric reconstruction details, and reducing artifacts in point cloud generation. Additionally, exploring its application in dynamic scenes and complex environments could further extend its practical value in areas like augmented reality and robotic navigation.
Quick Start
Getting Started
Clone the repository and install the required dependencies:
git clone https://github.com/yyfz/Pi3.git
cd Pi3
pip install -r requirements.txt
Run inference using example scripts that support processing image directories or video files. For faster model checkpoint downloads, manual download is suggested.
Detailed Usage Instructions
The model accepts image tensors as input and outputs a dictionary containing reconstructed geometric information. Each input tensor should have a shape of B×N×3×H×W, where B is the batch size, N is the number of images, and H and W are the height and width of the images.
Example Code Snippet
import torch
from pi3.models.pi3 import Pi3
from pi3.utils.basic import load_images_as_tensor
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Pi3.from_pretrained("yyfz233/Pi3").to(device).eval()
imgs = load_images_as_tensor('your_data_path', interval=10).to(device)
with torch.no_grad():
results = model(imgs[None])
print("Reconstruction completed!")
Original article by NenPower, If reposted, please credit the source: https://nenpower.com/blog/breakthrough-in-visual-geometry-reconstruction-zhejiang-university-team-releases-%cf%80%c2%b3-model-outperforming-vggt/
