Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Abstract

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry.

We introduce Sat3DGen to address these fundamental challenges, embodying a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation.

Paper at a Glance

Sat3DGen teaser: comparison with prior methods and downstream applications

Comparison of single-satellite-to-3D methods. Prior geometry-colorization methods (Sat2City) produce building-only geometry from synthetic supervision; proxy-based methods (Sat2Scene, Sat2Density++) generate holistic scenes but with coarse geometry. Sat3DGen achieves both high geometric fidelity and photorealistic rendering, enabling downstream applications including surround-view video, satellite-to-DSM, semantic-map-to-3D, and large-area mesh generation.

Method Overview

A frozen DINOv3 satellite encoder tokenizes the input image, spatial tokens expand the effective scene extent, a VAE-style decoder lifts tokens into a triplane 3D representation, and an MLP predicts per-point density and color. A separate sky module models illumination on the sphere.

1

Satellite Encoding

A frozen DINOv3 ViT backbone (pretrained on 493M satellite images) extracts a 16×16×1024 token grid from the input overhead RGB.

2

Triplane Lifting

Spatial tokens expand the footprint. A VAE-style decoder upsamples tokens into a 320×320 triplane (XY, XZ, YZ planes) covering the full city block with continuous 3D features.

3

Geometry-First Training

Three novel losses — gravity-based density variation, monocular relative depth prior, and panoramic-to-perspective supervision — explicitly enforce accurate geometry.

4

Mesh & Render

Marching Cubes extracts a watertight .obj mesh from the density field. Volume rendering supports satellite, panoramic, and perspective street-view cameras.

Interactive 3D Mesh

Drag to orbit · scroll to zoom · right-click to pan. Each tab loads a .obj (Open3D, per-vertex color) only when you first click it. We host five hand-picked results (named demo + four densest meshes by triangle count) so the page stays light. In-memory cache when switching back.

Select a scene…

Selection: scenes.json in assets/meshes/ (Scene 1 = curated sat_demo_6; Scenes 2–5 = top among the rest by triangle count). The six video tabs below are independent. WebGL: use a recent desktop browser if the canvas stays blank.

Demo Gallery

For each scene we show the input satellite tile + walkthrough trajectory, the orbit view of the recovered 3D mesh, the panoramic rendering along the trajectory, and a 4-camera street-view sweep.

Satellite + Trajectory

3D Mesh — Orbiting View

Web video is low-bitrate; for crisp motion see the paper / supplement and the released data.

Panorama Rendering Along Trajectory

Source 128×512 (4:1). 3:1 display frame at ~¾ of the gallery column width (cap 720px); object-fit: contain keeps it sharp (letterboxing as needed).

Street-Level Multi-Camera Video

Try It Yourself

Three ways to run Sat3DGen on your own satellite images — pick whichever matches your setup.

🤗

HuggingFace Spaces

Zero install — just open the link and upload a satellite tile. Runs on CPU (slow but free).

Open Demo →

🌐

Local Gradio App

Clone the repo, python app.py, then open http://localhost:7860. Recommended with a GPU.

See Instructions →

⌨️

Command-Line Demo

Run bash inference.sh <sat_image> for the full end-to-end pipeline: mesh, panorama, street video.

See Instructions →

Model Weights & Data

🤗 Pretrained Model HuggingFace ⚡ Pretrained Model ModelScope

Applications

Because Sat3DGen produces a real, high-fidelity 3D asset rather than view-conditioned pixels, the same model unlocks a family of downstream tasks out-of-the-box.

🗺️

Semantic-Map → 3D

Drive synthesis from a 2D semantic map of the satellite tile and obtain a consistent textured 3D scene.

🎬

Multi-Camera Video

Render arbitrary trajectories with multiple cameras (panorama + 4 perspective views) for downstream generative video.

🏙️

Large-Scale Meshing

Slice a large satellite raster into tiles, run Sat3DGen tile-wise, and stitch the resulting meshes into a city-block-scale 3D model.

📐

Unsupervised DSM

Height can be read from the recovered 3D to obtain a per-pixel DSM from a single satellite input at test time. We do not use DSM ground truth as supervision in training (no height-map loss); geometry is shaped by the losses in the paper instead.

Citation

If our work helps your research, please cite:

@inproceedings{
  qian2026sat3dgen,
  title     = {Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image},
  author    = {Ming Qian and Zimin Xia and Changkun Liu and Shuailei Ma
               and Wen Wang and Zeran Ke and Bin Tan and Hang Zhang and Gui-Song Xia},
  booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=E7JzkZCofa}
}

@ARTICLE{Qian_2026_Sat2Densitypp,
  author  = {Qian, Ming and Tan, Bin and Wang, Qiuyu and Zheng, Xianwei and Xiong, Hanjiang and Xia, Gui-Song and Shen, Yujun and Xue, Nan},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title   = {Seeing Through Satellite Images at Street Views},
  year    = {2026},
  volume  = {48},
  number  = {5},
  pages   = {5692--5709},
  doi     = {10.1109/TPAMI.2026.3652860}
}

@InProceedings{Qian_2023_Sat2Density,
  author    = {Qian, Ming and Xiong, Jincheng and Xia, Gui-Song and Xue, Nan},
  title     = {Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2023},
  pages     = {3683--3692}
}

Acknowledgements

This work builds on a number of excellent open-source projects, including Sat2Density, Sat2Density++, EG3D, DINOv3, PyTorch, and Diffusers. We also thank our collaborators and colleagues for their valuable feedback.

Page template inspired by PLANA3R.