ICLR 2026

Sat3DGen:
Comprehensive Street-Level 3D Scene Generation
from a Single Satellite Image

1Wuhan University 2EPFL 3HKUST 4Northeastern University 5Zhejiang University 6Ant Group 7AMap, Alibaba Group

Abstract

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry.

We introduce Sat3DGen to address these fundamental challenges, embodying a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation.

Paper at a Glance

Sat3DGen teaser: comparison with prior methods and downstream applications

Comparison of single-satellite-to-3D methods. Prior geometry-colorization methods (Sat2City) produce building-only geometry from synthetic supervision; proxy-based methods (Sat2Scene, Sat2Density++) generate holistic scenes but with coarse geometry. Sat3DGen achieves both high geometric fidelity and photorealistic rendering, enabling downstream applications including surround-view video, satellite-to-DSM, semantic-map-to-3D, and large-area mesh generation.

Method Overview

Sat3DGen pipeline diagram (Figure 2)

A frozen DINOv3 satellite encoder tokenizes the input image, spatial tokens expand the effective scene extent, a VAE-style decoder lifts tokens into a triplane 3D representation, and an MLP predicts per-point density and color. A separate sky module models illumination on the sphere.

1

Satellite Encoding

A frozen DINOv3 ViT backbone (pretrained on 493M satellite images) extracts a 16Γ—16Γ—1024 token grid from the input overhead RGB.

2

Triplane Lifting

Spatial tokens expand the footprint. A VAE-style decoder upsamples tokens into a 320Γ—320 triplane (XY, XZ, YZ planes) covering the full city block with continuous 3D features.

3

Geometry-First Training

Three novel losses β€” gravity-based density variation, monocular relative depth prior, and panoramic-to-perspective supervision β€” explicitly enforce accurate geometry.

4

Mesh & Render

Marching Cubes extracts a watertight .obj mesh from the density field. Volume rendering supports satellite, panoramic, and perspective street-view cameras.

Interactive 3D Mesh

Drag to orbit Β· scroll to zoom Β· right-click to pan. Each tab loads a .obj (Open3D, per-vertex color) only when you first click it. We host five hand-picked results (named demo + four densest meshes by triangle count) so the page stays light. In-memory cache when switching back.

Select a scene…

Selection: scenes.json in assets/meshes/ (Scene 1 = curated sat_demo_6; Scenes 2–5 = top among the rest by triangle count). The six video tabs below are independent. WebGL: use a recent desktop browser if the canvas stays blank.

Try It Yourself

Three ways to run Sat3DGen on your own satellite images β€” pick whichever matches your setup.

Applications

Because Sat3DGen produces a real, high-fidelity 3D asset rather than view-conditioned pixels, the same model unlocks a family of downstream tasks out-of-the-box.

πŸ—ΊοΈ

Semantic-Map β†’ 3D

Drive synthesis from a 2D semantic map of the satellite tile and obtain a consistent textured 3D scene.

🎬

Multi-Camera Video

Render arbitrary trajectories with multiple cameras (panorama + 4 perspective views) for downstream generative video.

πŸ™οΈ

Large-Scale Meshing

Slice a large satellite raster into tiles, run Sat3DGen tile-wise, and stitch the resulting meshes into a city-block-scale 3D model.

πŸ“

Unsupervised DSM

Height can be read from the recovered 3D to obtain a per-pixel DSM from a single satellite input at test time. We do not use DSM ground truth as supervision in training (no height-map loss); geometry is shaped by the losses in the paper instead.

Citation

If our work helps your research, please cite:

@inproceedings{
  qian2026sat3dgen,
  title     = {Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image},
  author    = {Ming Qian and Zimin Xia and Changkun Liu and Shuailei Ma
               and Wen Wang and Zeran Ke and Bin Tan and Hang Zhang and Gui-Song Xia},
  booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=E7JzkZCofa}
}
@ARTICLE{Qian_2026_Sat2Densitypp,
  author  = {Qian, Ming and Tan, Bin and Wang, Qiuyu and Zheng, Xianwei and Xiong, Hanjiang and Xia, Gui-Song and Shen, Yujun and Xue, Nan},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title   = {Seeing Through Satellite Images at Street Views},
  year    = {2026},
  volume  = {48},
  number  = {5},
  pages   = {5692--5709},
  doi     = {10.1109/TPAMI.2026.3652860}
}
@InProceedings{Qian_2023_Sat2Density,
  author    = {Qian, Ming and Xiong, Jincheng and Xia, Gui-Song and Xue, Nan},
  title     = {Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2023},
  pages     = {3683--3692}
}

Acknowledgements

This work builds on a number of excellent open-source projects, including Sat2Density, Sat2Density++, EG3D, DINOv3, PyTorch, and Diffusers. We also thank our collaborators and colleagues for their valuable feedback.

Page template inspired by PLANA3R.