Spaces:
Runtime error
Runtime error
amankishore
commited on
Commit
•
c255c40
1
Parent(s):
5426b53
Added subpixel rendering!
Browse files- README-orig.md +27 -22
- README.md +9 -1
- app.py +7 -3
- highres_final_vis.py +124 -0
- voxnerf/vox.py +0 -3
README-orig.md
CHANGED
@@ -9,26 +9,35 @@
|
|
9 |
|
10 |
TTI-Chicago, †Purdue University
|
11 |
|
12 |
-
|
13 |
|
14 |
-
> We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as learned predictors of a gradient field, often referred to as the score function of the data log-likelihood. We apply the chain rule on the estimated score, hence the name Score Jacobian Chaining (SJC).
|
15 |
|
16 |
<a href="https://arxiv.org/abs/2212.00774"><img src="https://img.shields.io/badge/arXiv-2212.00774-b31b1b.svg" height=22.5></a>
|
17 |
-
<a href="https://colab.research.google.com/drive/1zixo66UYGl70VOPy053o7IV_YkQt5lCZ?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=22.5></a>
|
18 |
-
<a href="https://pals.ttic.edu/p/score-jacobian-chaining"><img src="https://img.shields.io/website?down_color=lightgrey&down_message=offline&label=Project%20Page&up_color=lightgreen&up_message=online&url=https%3A%2F%2Fpals.ttic.edu%2Fp%2Fscore-jacobian-chaining" height=22.5></a>
|
19 |
|
20 |
<!-- [ [arxiv](https://arxiv.org/abs/2212.00774) | [project page](https://pals.ttic.edu/p/score-jacobian-chaining) | [colab](https://colab.research.google.com/drive/1zixo66UYGl70VOPy053o7IV_YkQt5lCZ?usp=sharing ) ] -->
|
21 |
|
22 |
Many thanks to [dvschultz](https://github.com/dvschultz) for the colab.
|
23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
## License
|
25 |
-
Since we use Stable Diffusion, we are releasing under their OpenRAIL license. Otherwise we do not
|
26 |
-
identify any components or upstream code that carry restrictive licensing requirements.
|
27 |
|
28 |
-
## Structure
|
29 |
-
In addition to SJC, the repo also contains an implementation of [Karras sampler](https://arxiv.org/abs/2206.00364),
|
30 |
-
and a customized, simple voxel nerf. We provide the abstract parent class based on Karras et. al. and include
|
31 |
-
a few types of diffusion model here. See adapt.py.
|
32 |
|
33 |
## Installation
|
34 |
|
@@ -46,8 +55,8 @@ git clone --depth 1 git@github.com:CompVis/taming-transformers.git && pip instal
|
|
46 |
|
47 |
## Downloading checkpoints
|
48 |
We have bundled a minimal set of things you need to download (SD v1.5 ckpt, gddpm ckpt for LSUN and FFHQ)
|
49 |
-
in a tar file, made available at our download server [here](https://dl.ttic.edu/pals/sjc/release.tar).
|
50 |
-
It is a single file of 12GB, and you can use wget or curl.
|
51 |
|
52 |
Remember to __update__ `env.json` to point at the new checkpoint root where you have uncompressed the files.
|
53 |
|
@@ -57,7 +66,7 @@ Make a new directory to run experiments (the script generates many logging files
|
|
57 |
mkdir exp
|
58 |
cd exp
|
59 |
```
|
60 |
-
Run the following command to generate a new 3D asset. It takes about 25 minutes on a single A5000 GPU for 10000 steps of optimization.
|
61 |
```bash
|
62 |
python /path/to/sjc/run_sjc.py \
|
63 |
--sd.prompt "A zoomed out high quality photo of Temple of Heaven" \
|
@@ -86,15 +95,11 @@ python /path/to/sjc/run_sjc.py \
|
|
86 |
|
87 |
`depth_weight` the weighting factor of the center depth loss
|
88 |
|
89 |
-
`var_red` whether to use Eq. 16 vs Eq. 15. For some prompts such as Obama we actually see better results with Eq. 15.
|
90 |
|
91 |
Visualization results are stored in the current directory. In directories named `test_*` there are images (under `view`) and videos (under `view_seq`) rendered at different iterations.
|
92 |
|
93 |
|
94 |
-
## TODOs
|
95 |
-
- [ ] add sub-pixel rendering script for high quality visualization such as in the teaser.
|
96 |
-
- [ ] add script to reproduce 2D experiments in Fig 4. The Fig might need change once it's tied to seeds. Note that for a simple aligned domain like faces, simple scheduling like using a single σ=1.5 could already generate some nice images. But not so for bedrooms; it's too diverse and annealing seems still needed.
|
97 |
-
|
98 |
## To Reproduce the Results in the Paper
|
99 |
First create a clean directory for your experiment, then run one of the following scripts from that folder:
|
100 |
### Trump
|
@@ -200,19 +205,19 @@ python /path/to/sjc/run_sjc.py --sd.prompt "A pig" --n_steps 10000 --lr 0.05 --s
|
|
200 |
```
|
201 |
python /path/to/sjc/run_nerf.py
|
202 |
```
|
203 |
-
Our bundle contains a tar ball for the lego bulldozer dataset. Untar it and it will work.
|
204 |
|
205 |
## To Sample 2D images with the Karras Sampler
|
206 |
```
|
207 |
python /path/to/sjc/run_img_sampling.py
|
208 |
```
|
209 |
-
Use help -h to see the options available. Will expand the details later.
|
210 |
|
211 |
|
212 |
-
## Bib
|
213 |
```
|
214 |
@article{sjc,
|
215 |
-
title={Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation},
|
216 |
author={Wang, Haochen and Du, Xiaodan and Li, Jiahao and Yeh, Raymond A. and Shakhnarovich, Greg},
|
217 |
journal={arXiv preprint arXiv:2212.00774},
|
218 |
year={2022},
|
|
|
9 |
|
10 |
TTI-Chicago, †Purdue University
|
11 |
|
12 |
+
Abstract: *A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.*
|
13 |
|
|
|
14 |
|
15 |
<a href="https://arxiv.org/abs/2212.00774"><img src="https://img.shields.io/badge/arXiv-2212.00774-b31b1b.svg" height=22.5></a>
|
16 |
+
<a href="https://colab.research.google.com/drive/1zixo66UYGl70VOPy053o7IV_YkQt5lCZ?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=22.5></a>
|
17 |
+
<a href="https://pals.ttic.edu/p/score-jacobian-chaining"><img src="https://img.shields.io/website?down_color=lightgrey&down_message=offline&label=Project%20Page&up_color=lightgreen&up_message=online&url=https%3A%2F%2Fpals.ttic.edu%2Fp%2Fscore-jacobian-chaining" height=22.5></a>
|
18 |
|
19 |
<!-- [ [arxiv](https://arxiv.org/abs/2212.00774) | [project page](https://pals.ttic.edu/p/score-jacobian-chaining) | [colab](https://colab.research.google.com/drive/1zixo66UYGl70VOPy053o7IV_YkQt5lCZ?usp=sharing ) ] -->
|
20 |
|
21 |
Many thanks to [dvschultz](https://github.com/dvschultz) for the colab.
|
22 |
|
23 |
+
## Updates
|
24 |
+
- We have added subpixel rendering script for final high quality vis. The jittery videos you might have seen should be significantly better now. Please run `python /path/to/sjc/highres_final_vis.py` in the exp folder after the training is complete. There are a few toggles in the script you can play with, but the default is ok. It takes about 5 minutes / 11GB on an A5000, and the extra time is mainly due to SD Decoder.
|
25 |
+
- If you are running SJC with a DreamBooth fine-tuned model: the model's output distribution is already significantly narrowed. It might help to use a lower guidance scale `--sd.scale 50.0` for example. Intense mode-seeking is one cause for multi-face problem. We have internally tried DreamBooth with view-dependent prompt fine-tuning. But by and large DreamBooth integration is not ready.
|
26 |
+
|
27 |
+
|
28 |
+
## TODOs
|
29 |
+
- [ ] make seeds configurable. So far all seeds are hardcoded to 0.
|
30 |
+
- [ ] add script to reproduce 2D experiments in Fig 4. The Fig might need change once it's tied to seeds. Note that for a simple aligned domain like faces, simple scheduling like using a single σ=1.5 could already generate some nice images. But not so for bedrooms; it's too diverse and annealing seems still needed.
|
31 |
+
- [ ] main paper figures did not use subpix rendering; appendix figures did. Replace the main paper figures to make them consistent.
|
32 |
+
|
33 |
## License
|
34 |
+
Since we use Stable Diffusion, we are releasing under their OpenRAIL license. Otherwise we do not
|
35 |
+
identify any components or upstream code that carry restrictive licensing requirements.
|
36 |
|
37 |
+
## Structure
|
38 |
+
In addition to SJC, the repo also contains an implementation of [Karras sampler](https://arxiv.org/abs/2206.00364),
|
39 |
+
and a customized, simple voxel nerf. We provide the abstract parent class based on Karras et. al. and include
|
40 |
+
a few types of diffusion model here. See adapt.py.
|
41 |
|
42 |
## Installation
|
43 |
|
|
|
55 |
|
56 |
## Downloading checkpoints
|
57 |
We have bundled a minimal set of things you need to download (SD v1.5 ckpt, gddpm ckpt for LSUN and FFHQ)
|
58 |
+
in a tar file, made available at our download server [here](https://dl.ttic.edu/pals/sjc/release.tar).
|
59 |
+
It is a single file of 12GB, and you can use wget or curl.
|
60 |
|
61 |
Remember to __update__ `env.json` to point at the new checkpoint root where you have uncompressed the files.
|
62 |
|
|
|
66 |
mkdir exp
|
67 |
cd exp
|
68 |
```
|
69 |
+
Run the following command to generate a new 3D asset. It takes about 25 minutes / 10GB GPU mem on a single A5000 GPU for 10000 steps of optimization.
|
70 |
```bash
|
71 |
python /path/to/sjc/run_sjc.py \
|
72 |
--sd.prompt "A zoomed out high quality photo of Temple of Heaven" \
|
|
|
95 |
|
96 |
`depth_weight` the weighting factor of the center depth loss
|
97 |
|
98 |
+
`var_red` whether to use Eq. 16 vs Eq. 15. For some prompts such as Obama we actually see better results with Eq. 15.
|
99 |
|
100 |
Visualization results are stored in the current directory. In directories named `test_*` there are images (under `view`) and videos (under `view_seq`) rendered at different iterations.
|
101 |
|
102 |
|
|
|
|
|
|
|
|
|
103 |
## To Reproduce the Results in the Paper
|
104 |
First create a clean directory for your experiment, then run one of the following scripts from that folder:
|
105 |
### Trump
|
|
|
205 |
```
|
206 |
python /path/to/sjc/run_nerf.py
|
207 |
```
|
208 |
+
Our bundle contains a tar ball for the lego bulldozer dataset. Untar it and it will work.
|
209 |
|
210 |
## To Sample 2D images with the Karras Sampler
|
211 |
```
|
212 |
python /path/to/sjc/run_img_sampling.py
|
213 |
```
|
214 |
+
Use help -h to see the options available. Will expand the details later.
|
215 |
|
216 |
|
217 |
+
## Bib
|
218 |
```
|
219 |
@article{sjc,
|
220 |
+
title={Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation},
|
221 |
author={Wang, Haochen and Du, Xiaodan and Li, Jiahao and Yeh, Raymond A. and Shakhnarovich, Greg},
|
222 |
journal={arXiv preprint arXiv:2212.00774},
|
223 |
year={2022},
|
README.md
CHANGED
@@ -10,4 +10,12 @@ pinned: false
|
|
10 |
license: creativeml-openrail-m
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
license: creativeml-openrail-m
|
11 |
---
|
12 |
|
13 |
+
## Bib
|
14 |
+
```
|
15 |
+
@article{sjc,
|
16 |
+
title={Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation},
|
17 |
+
author={Wang, Haochen and Du, Xiaodan and Li, Jiahao and Yeh, Raymond A. and Shakhnarovich, Greg},
|
18 |
+
journal={arXiv preprint arXiv:2212.00774},
|
19 |
+
year={2022},
|
20 |
+
}
|
21 |
+
```
|
app.py
CHANGED
@@ -16,6 +16,7 @@ from voxnerf.utils import every
|
|
16 |
from voxnerf.vis import stitch_vis, bad_vis as nerf_vis
|
17 |
|
18 |
from run_sjc import render_one_view, tsr_stats
|
|
|
19 |
|
20 |
import gradio as gr
|
21 |
import gc
|
@@ -167,22 +168,25 @@ with gr.Blocks(css=css) as demo:
|
|
167 |
|
168 |
# TODO: Save Checkpoint
|
169 |
with torch.no_grad():
|
|
|
|
|
170 |
ckpt = vox.state_dict()
|
171 |
H, W = poser.H, poser.W
|
172 |
vox.eval()
|
173 |
-
K, poses = poser.sample_test(
|
|
|
|
|
174 |
|
175 |
aabb = vox.aabb.T.cpu().numpy()
|
176 |
vox = vox.to(device_glb)
|
177 |
|
178 |
num_imgs = len(poses)
|
179 |
-
|
180 |
all_images = []
|
181 |
|
182 |
for i in (pbar := tqdm(range(num_imgs))):
|
183 |
|
184 |
pose = poses[i]
|
185 |
-
y, depth =
|
186 |
if isinstance(model, StableDiffusion):
|
187 |
y = model.decode(y)
|
188 |
pane, img, depth = vis_routine(y, depth)
|
|
|
16 |
from voxnerf.vis import stitch_vis, bad_vis as nerf_vis
|
17 |
|
18 |
from run_sjc import render_one_view, tsr_stats
|
19 |
+
from highres_final_vis import highres_render_one_view
|
20 |
|
21 |
import gradio as gr
|
22 |
import gc
|
|
|
168 |
|
169 |
# TODO: Save Checkpoint
|
170 |
with torch.no_grad():
|
171 |
+
n_frames=200
|
172 |
+
factor=4
|
173 |
ckpt = vox.state_dict()
|
174 |
H, W = poser.H, poser.W
|
175 |
vox.eval()
|
176 |
+
K, poses = poser.sample_test(n_frames)
|
177 |
+
del n_frames
|
178 |
+
poses = poses[60:] # skip the full overhead view; not interesting
|
179 |
|
180 |
aabb = vox.aabb.T.cpu().numpy()
|
181 |
vox = vox.to(device_glb)
|
182 |
|
183 |
num_imgs = len(poses)
|
|
|
184 |
all_images = []
|
185 |
|
186 |
for i in (pbar := tqdm(range(num_imgs))):
|
187 |
|
188 |
pose = poses[i]
|
189 |
+
y, depth = highres_render_one_view(vox, aabb, H, W, K, pose, f=factor)
|
190 |
if isinstance(model, StableDiffusion):
|
191 |
y = model.decode(y)
|
192 |
pane, img, depth = vis_routine(y, depth)
|
highres_final_vis.py
ADDED
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import numpy as np
|
2 |
+
import torch
|
3 |
+
from einops import rearrange
|
4 |
+
|
5 |
+
from voxnerf.render import subpixel_rays_from_img
|
6 |
+
|
7 |
+
from run_sjc import (
|
8 |
+
SJC, ScoreAdapter, StableDiffusion,
|
9 |
+
tqdm, EventStorage, HeartBeat, EarlyLoopBreak, get_event_storage, get_heartbeat, optional_load_config, read_stats,
|
10 |
+
vis_routine, stitch_vis, latest_ckpt,
|
11 |
+
scene_box_filter, render_ray_bundle, as_torch_tsrs,
|
12 |
+
device_glb
|
13 |
+
)
|
14 |
+
|
15 |
+
|
16 |
+
# the SD deocder is very memory hungry; the latent image cannot be too large
|
17 |
+
# for a graphics card with < 12 GB memory, set this to 128; quality already good
|
18 |
+
# if your card has 12 to 24 GB memory, you can set this to 200;
|
19 |
+
# but visually it won't help beyond a certain point. Our teaser is done with 128.
|
20 |
+
decoder_bottleneck_hw = 128
|
21 |
+
|
22 |
+
|
23 |
+
def final_vis():
|
24 |
+
cfg = optional_load_config(fname="full_config.yml")
|
25 |
+
assert len(cfg) > 0, "can't find cfg file"
|
26 |
+
mod = SJC(**cfg)
|
27 |
+
|
28 |
+
family = cfg.pop("family")
|
29 |
+
model: ScoreAdapter = getattr(mod, family).make()
|
30 |
+
vox = mod.vox.make()
|
31 |
+
poser = mod.pose.make()
|
32 |
+
|
33 |
+
pbar = tqdm(range(1))
|
34 |
+
|
35 |
+
with EventStorage(), HeartBeat(pbar):
|
36 |
+
ckpt_fname = latest_ckpt()
|
37 |
+
state = torch.load(ckpt_fname, map_location="cpu")
|
38 |
+
vox.load_state_dict(state)
|
39 |
+
vox.to(device_glb)
|
40 |
+
|
41 |
+
with EventStorage("highres"):
|
42 |
+
# what dominates the speed is NOT the factor here.
|
43 |
+
# you can try from 2 to 8, and the speed is about the same.
|
44 |
+
# the dominating factor in the pipeline I believe is the SD decoder.
|
45 |
+
evaluate(model, vox, poser, n_frames=200, factor=4)
|
46 |
+
|
47 |
+
|
48 |
+
@torch.no_grad()
|
49 |
+
def evaluate(score_model, vox, poser, n_frames=200, factor=4):
|
50 |
+
H, W = poser.H, poser.W
|
51 |
+
vox.eval()
|
52 |
+
K, poses = poser.sample_test(n_frames)
|
53 |
+
del n_frames
|
54 |
+
poses = poses[60:] # skip the full overhead view; not interesting
|
55 |
+
|
56 |
+
fuse = EarlyLoopBreak(5)
|
57 |
+
metric = get_event_storage()
|
58 |
+
hbeat = get_heartbeat()
|
59 |
+
|
60 |
+
aabb = vox.aabb.T.cpu().numpy()
|
61 |
+
vox = vox.to(device_glb)
|
62 |
+
|
63 |
+
num_imgs = len(poses)
|
64 |
+
|
65 |
+
for i in (pbar := tqdm(range(num_imgs))):
|
66 |
+
if fuse.on_break():
|
67 |
+
break
|
68 |
+
|
69 |
+
pose = poses[i]
|
70 |
+
y, depth = highres_render_one_view(vox, aabb, H, W, K, pose, f=factor)
|
71 |
+
if isinstance(score_model, StableDiffusion):
|
72 |
+
y = score_model.decode(y)
|
73 |
+
vis_routine(metric, y, depth)
|
74 |
+
|
75 |
+
metric.step()
|
76 |
+
hbeat.beat()
|
77 |
+
|
78 |
+
metric.flush_history()
|
79 |
+
|
80 |
+
metric.put_artifact(
|
81 |
+
"movie_im_and_depth", ".mp4",
|
82 |
+
lambda fn: stitch_vis(fn, read_stats(metric.output_dir, "view")[1])
|
83 |
+
)
|
84 |
+
|
85 |
+
metric.put_artifact(
|
86 |
+
"movie_im_only", ".mp4",
|
87 |
+
lambda fn: stitch_vis(fn, read_stats(metric.output_dir, "img")[1])
|
88 |
+
)
|
89 |
+
|
90 |
+
metric.step()
|
91 |
+
|
92 |
+
|
93 |
+
def highres_render_one_view(vox, aabb, H, W, K, pose, f=4):
|
94 |
+
bs = 4096
|
95 |
+
|
96 |
+
ro, rd = subpixel_rays_from_img(H, W, K, pose, f=f)
|
97 |
+
ro, rd, t_min, t_max = scene_box_filter(ro, rd, aabb)
|
98 |
+
n = len(ro)
|
99 |
+
ro, rd, t_min, t_max = as_torch_tsrs(vox.device, ro, rd, t_min, t_max)
|
100 |
+
|
101 |
+
rgbs = torch.zeros(n, 4, device=vox.device)
|
102 |
+
depth = torch.zeros(n, 1, device=vox.device)
|
103 |
+
|
104 |
+
with torch.no_grad():
|
105 |
+
for i in range(int(np.ceil(n / bs))):
|
106 |
+
s = i * bs
|
107 |
+
e = min(n, s + bs)
|
108 |
+
_rgbs, _depth, _ = render_ray_bundle(
|
109 |
+
vox, ro[s:e], rd[s:e], t_min[s:e], t_max[s:e]
|
110 |
+
)
|
111 |
+
rgbs[s:e] = _rgbs
|
112 |
+
depth[s:e] = _depth
|
113 |
+
|
114 |
+
rgbs = rearrange(rgbs, "(h w) c -> 1 c h w", h=H*f, w=W*f)
|
115 |
+
depth = rearrange(depth, "(h w) 1 -> h w", h=H*f, w=W*f)
|
116 |
+
rgbs = torch.nn.functional.interpolate(
|
117 |
+
rgbs, (decoder_bottleneck_hw, decoder_bottleneck_hw),
|
118 |
+
mode='bilinear', antialias=True
|
119 |
+
)
|
120 |
+
return rgbs, depth
|
121 |
+
|
122 |
+
|
123 |
+
if __name__ == "__main__":
|
124 |
+
final_vis()
|
voxnerf/vox.py
CHANGED
@@ -169,9 +169,6 @@ class VoxRF(nn.Module):
|
|
169 |
|
170 |
@VOXRF_REGISTRY.register()
|
171 |
class V_SJC(VoxRF):
|
172 |
-
"""
|
173 |
-
For SJC, when sampling density σ, add a gaussian ball offset
|
174 |
-
"""
|
175 |
def __init__(self, *args, **kwargs):
|
176 |
super().__init__(*args, **kwargs)
|
177 |
# rendering color in [-1, 1] range, since score models all operate on centered img
|
|
|
169 |
|
170 |
@VOXRF_REGISTRY.register()
|
171 |
class V_SJC(VoxRF):
|
|
|
|
|
|
|
172 |
def __init__(self, *args, **kwargs):
|
173 |
super().__init__(*args, **kwargs)
|
174 |
# rendering color in [-1, 1] range, since score models all operate on centered img
|