VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
Abstract
VIST3A combines latent text-to-video models and 3D reconstruction systems to generate high-quality 3D scenes from text, improving upon prior methods.
The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
Community
We introduce VIST3A, which replaces the VAE decoder with 3D foundational models such as AnySplat and VGGT, making LDM generate 3D representations. The generative model is aligned with the replaced decoder, making LDM more reliable.
Webpage: https://gohyojun15.github.io/VIST3A/
Code (under construction): https://github.com/gohyojun15/VIST3A
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FlashWorld: High-quality 3D Scene Generation within Seconds (2025)
- ShapeGen4D: Towards High Quality 4D Shape Generation from Videos (2025)
- Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation (2025)
- FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction (2025)
- Tinker: Diffusion's Gift to 3D-Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (2025)
- UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation (2025)
- WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
