File size: 3,318 Bytes
f90e4dc
 
 
 
f2bedf9
7c9bcec
 
9cdf308
7c9bcec
9cdf308
7c9bcec
9cdf308
7c9bcec
 
 
 
f2bedf9
74621d6
 
 
 
f2bedf9
 
2474882
f2bedf9
 
 
d83032b
f2bedf9
d83032b
f2bedf9
 
d83032b
 
 
f2bedf9
d83032b
 
 
 
 
f2bedf9
 
d83032b
 
 
f2bedf9
d83032b
 
 
f2bedf9
d83032b
 
 
 
2272399
 
 
 
 
 
 
 
 
 
 
95359f7
 
 
 
 
 
 
 
f90e4dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
library_name: diffusers
pipeline_tag: image-to-image
---

# LDM3D-VR model

The LDM3D-VR model was proposed in ["LDM3D-VR: Latent Diffusion Model for 3D"](https://arxiv.org/pdf/2311.03226.pdf) by Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal.

LDM3D-VR got accepted to [NeurIPS Workshop'23 on Diffusion Models][https://neurips.cc/virtual/2023/workshop/66539].

This new checkpoint related to the upscaler called ldm3d-sr.

# Model description
The abstract from the paper is the following: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano
and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods. 

![LDM3D-SR overview](ldm3d-sr-overview.png)
<font size="2">LDM3D-SR overview </font>


## Examples

Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers) in a simple and efficient manner.

```python
from PIL import Image
import os
import torch
from diffusers import StableDiffusionUpscaleLDM3DPipeline,  StableDiffusionLDM3DPipeline


#Generate a rgb/depth output from LDM3D
pipe_ldm3d = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d-4c")
pipe_ldm3d.to("cuda")

prompt =f"A picture of some lemons on a table"
output = pipe_ldm3d(prompt)
rgb_image, depth_image = output.rgb, output.depth
rgb_image[0].save(f"lemons_ldm3d_rgb.jpg")
depth_image[0].save(f"lemons_ldm3d_depth.png")


#Upscale the previous output to a resolution of (1024, 1024)
pipe_ldm3d_upscale = StableDiffusionUpscaleLDM3DPipeline.from_pretrained("Intel/ldm3d-sr")
pipe_ldm3d_upscale.to("cuda")

low_res_img = Image.open(f"lemons_ldm3d_rgb.jpg").convert("RGB")
low_res_depth = Image.open(f"lemons_ldm3d_depth.png").convert("L")
outputs = pipe_ldm3d_upscale(prompt="high quality high resolution uhd 4k image", rgb=low_res_img, depth=low_res_depth, num_inference_steps=50, target_res=[1024, 1024])

upscaled_rgb, upscaled_depth =outputs.rgb[0], outputs.depth[0]
upscaled_rgb.save(f"upscaled_lemons_rgb.png")
upscaled_depth.save(f"upscaled_lemons_depth.png")
```


## Results

Output of ldm3d-4c          |  Upscaled output
:-------------------------:|:-------------------------:
![ldm3d_rgb_results](lemons_ldm3d_rgb.jpg)  |  ![ldm3d_sr_rgb_results](upscaled_lemons_rgb.png)
![ldm3d_depth_results](lemons_ldm3d_depth.png)  |  ![ldm3d_sr_depth_results](upscaled_lemons_depth.png)



### BibTeX entry and citation info
@misc{stan2023ldm3dvr,
      title={LDM3D-VR: Latent Diffusion Model for 3D VR}, 
      author={Gabriela Ben Melech Stan and Diana Wofk and Estelle Aflalo and Shao-Yen Tseng and Zhipeng Cai and Michael Paulitsch and Vasudev Lal},
      year={2023},
      eprint={2311.03226},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}