Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry.
We introduce Sat3DGen to address these fundamental challenges, embodying a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation.
Comparison of single-satellite-to-3D methods. Prior geometry-colorization methods (Sat2City) produce building-only geometry from synthetic supervision; proxy-based methods (Sat2Scene, Sat2Density++) generate holistic scenes but with coarse geometry. Sat3DGen achieves both high geometric fidelity and photorealistic rendering, enabling downstream applications including surround-view video, satellite-to-DSM, semantic-map-to-3D, and large-area mesh generation.
A frozen DINOv3 satellite encoder tokenizes the input image, spatial tokens expand the effective scene extent, a VAE-style decoder lifts tokens into a triplane 3D representation, and an MLP predicts per-point density and color. A separate sky module models illumination on the sphere.
A frozen DINOv3 ViT backbone (pretrained on 493M satellite images) extracts a 16Γ16Γ1024 token grid from the input overhead RGB.
Spatial tokens expand the footprint. A VAE-style decoder upsamples tokens into a 320Γ320 triplane (XY, XZ, YZ planes) covering the full city block with continuous 3D features.
Three novel losses β gravity-based density variation, monocular relative depth prior, and panoramic-to-perspective supervision β explicitly enforce accurate geometry.
Marching Cubes extracts a watertight .obj mesh from the density field. Volume rendering supports satellite, panoramic, and perspective street-view cameras.
Drag to orbit Β· scroll to zoom Β· right-click to pan. Each tab loads a
.obj (Open3D, per-vertex color) only when you first click it.
We host five hand-picked results (named demo + four densest meshes by triangle count) so the
page stays light. In-memory cache when switching back.
Selection: scenes.json in assets/meshes/ (Scene 1 = curated
sat_demo_6; Scenes 2β5 = top among the rest by triangle count). The six
video tabs below are independent. WebGL: use a recent desktop browser if the canvas
stays blank.
For each scene we show the input satellite tile + walkthrough trajectory, the orbit view of the recovered 3D mesh, the panoramic rendering along the trajectory, and a 4-camera street-view sweep.
Web video is low-bitrate; for crisp motion see the paper / supplement and the released data.
Source 128Γ512 (4:1). 3:1 display frame at ~ΒΎ of the gallery column width (cap 720px); object-fit: contain keeps it sharp (letterboxing as needed).
Three ways to run Sat3DGen on your own satellite images β pick whichever matches your setup.
Zero install β just open the link and upload a satellite tile. Runs on CPU (slow but free).
Open Demo βClone the repo, python app.py, then open
http://localhost:7860. Recommended with a GPU.
Run bash inference.sh <sat_image> for the full
end-to-end pipeline: mesh, panorama, street video.
Because Sat3DGen produces a real, high-fidelity 3D asset rather than view-conditioned pixels, the same model unlocks a family of downstream tasks out-of-the-box.
Drive synthesis from a 2D semantic map of the satellite tile and obtain a consistent textured 3D scene.
Render arbitrary trajectories with multiple cameras (panorama + 4 perspective views) for downstream generative video.
Slice a large satellite raster into tiles, run Sat3DGen tile-wise, and stitch the resulting meshes into a city-block-scale 3D model.
Height can be read from the recovered 3D to obtain a per-pixel DSM from a single satellite input at test time. We do not use DSM ground truth as supervision in training (no height-map loss); geometry is shaped by the losses in the paper instead.
If our work helps your research, please cite:
@inproceedings{
qian2026sat3dgen,
title = {Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image},
author = {Ming Qian and Zimin Xia and Changkun Liu and Shuailei Ma
and Wen Wang and Zeran Ke and Bin Tan and Hang Zhang and Gui-Song Xia},
booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
year = {2026},
url = {https://openreview.net/forum?id=E7JzkZCofa}
}
@ARTICLE{Qian_2026_Sat2Densitypp,
author = {Qian, Ming and Tan, Bin and Wang, Qiuyu and Zheng, Xianwei and Xiong, Hanjiang and Xia, Gui-Song and Shen, Yujun and Xue, Nan},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
title = {Seeing Through Satellite Images at Street Views},
year = {2026},
volume = {48},
number = {5},
pages = {5692--5709},
doi = {10.1109/TPAMI.2026.3652860}
}
@InProceedings{Qian_2023_Sat2Density,
author = {Qian, Ming and Xiong, Jincheng and Xia, Gui-Song and Xue, Nan},
title = {Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {3683--3692}
}
This work builds on a number of excellent open-source projects, including Sat2Density, Sat2Density++, EG3D, DINOv3, PyTorch, and Diffusers. We also thank our collaborators and colleagues for their valuable feedback.
Page template inspired by PLANA3R.