Photo3D aims for photorealistic 3D generation by enhancing realistic appearance details while preserving structural consistency. Dedicated training strategies are designed for different 3D‑native paradigms.
We build a 3D‑aligned multi‑view synthesis pipeline and construct a realism‑enhanced dataset, Photo3D‑MV, to support training of photorealistic 3D‑native models.
Experiments show that Photo3D achieves state‑of‑the‑art photorealistic 3D generation across multiple paradigms and benchmarks.
Photo3D introduces Photo3D‑MV, a detail‑enhanced multi‑view dataset with aligned 3D geometry, together with training strategies that learn realistic 3D appearance across geometry‑texture coupled and decoupled 3D‑native paradigms. This framework enables photorealistic and structure‑consistent 3D generation.
Photo3D‑MV is constructed through a structure‑aligned multi‑view synthesis pipeline. We convert text prompts into object‑centric descriptions, generate images with Flux.1‑Dev, reconstruct 3D assets using Trellis, and refine multi‑view renderings with GPT‑4o‑Image. The resulting photorealistic views, text descriptions, and 3D assets together form the Photo3D‑MV dataset.
Photo3D‑MV contains 10k objects across 373 categories. Each object includes four realistic views and the corresponding 3D assets.
@misc{liang2025photo3dadvancingphotorealistic3d,
title={Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement},
author={Xinyue Liang and Zhinyuan Ma and Lingchen Sun and Yanjun Guo and Lei Zhang},
year={2025},
eprint={2512.08535},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.08535},
}