anzorq/finetuned_diffusion · Getting this error when coping it to my space

umair007

Dec 27, 2022

anzorq

Owner Dec 27, 2022

Are you running it on GPU?

umair007

Dec 27, 2022

CPU

Omnibus

Dec 29, 2022

This is working for CPU, but the processing takes 15 minutes for Redshift Render

https://huggingface.co/spaces/Omnibus/finetuned_diffusion_cpu

johnslegers

Jan 27, 2023

•

edited Jan 27, 2023

@anzorq :

I had that exact same error when running a duplicate of your space on a T4.

Just copying & running on a CPU produces this totally unhelpful error :

@umair007 & @anzorq :

You should be able to fix the layernormkernelimpl not implemented for 'half' error by going to the following lines in your requirements.txt.

torch
torchvision==0.13.1+cu113

Now, replace them with this :

torch==1.12.1+cu113
torchvision==0.13.1+cu113

That should fix that particular bug.

After I fixed that, however, my app broke on the following line :

pipe.enable_xformers_memory_efficient_attention()

I tried fixing this by taking the following lines :

if torch.cuda.is_available():
  pipe = pipe.to("cuda")
  pipe.enable_xformers_memory_efficient_attention()

And then I replaced them with this :

to_cuda(torch, pipe)

def to_cuda(torch, pipe):
    try:
        if torch.cuda.is_available():
          pipe = pipe.to("cuda")
          pipe.enable_xformers_memory_efficient_attention()
        return True
    except:
        return False

After that, the app started succesfully. However, it was still quite unstable and kept throwing errors.

@Omnibus :

At least your version works.

Unfortunately it's also very slow when running on a T4.

It should be optimized, so it runs slow on a CPU and fast on a GPU.

Omnibus

Jan 27, 2023

@johnslegers

I had found that switching all of the "torch_dtype=torch.float16" to "torch_dtype=torch.get_default_dtype()" allows the program to run on CPU, and removes the
" layernormkernelimpl not implemented for 'half' " error when running on CPU. Also, changing the requirements to download CPU compatible modules as you mentioned.

I sense that in order for the program to perform on both CPU and GPU a toggle like this might work:

"if device = GPU: torch_dtype=torch.float16 elif device = CPU: torch_dtype=torch.get_default_dtype()"

anzorq

Owner Jan 27, 2023

@johnslegers

I had found that switching all of the "torch_dtype=torch.float16" to "torch_dtype=torch.get_default_dtype()" allows the program to run on CPU, and removes the
" layernormkernelimpl not implemented for 'half' " error when running on CPU. Also, changing the requirements to download CPU compatible modules as you mentioned.

I sense that in order for the program to perform on both CPU and GPU a toggle like this might work:

"if device = GPU: torch_dtype=torch.float16 elif device = CPU: torch_dtype=torch.get_default_dtype()"

This demo was meant to be run on GPU only. If you want to run it on CPU follow the above instruction.

johnslegers

Jan 27, 2023

•

edited Jan 27, 2023

@anzorq :

Please read my previous comment.

As I already explained, I tried running it in a GPU-enabled environment (T4 small --- 4 vCPU / 15 GiB RAM / Nvidia T4) and I got the same error as the OP.

As I also already explained, I got rid of the layernormkernelimpl not implemented for 'half' error by adding a torch version compatible with the torchvision version.

However, after that it produces a different error :

Traceback (most recent call last):
  File "app.py", line 52, in <module>
    pipe.enable_xformers_memory_efficient_attention()
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 870, in enable_xformers_memory_efficient_attention
    self.set_use_memory_efficient_attention_xformers(True, attention_op)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 895, in set_use_memory_efficient_attention_xformers
    fn_recursive_set_mem_eff(module)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 886, in fn_recursive_set_mem_eff
    module.set_use_memory_efficient_attention_xformers(valid, attention_op)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/modeling_utils.py", line 208, in set_use_memory_efficient_attention_xformers
    fn_recursive_set_mem_eff(module)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/modeling_utils.py", line 204, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/modeling_utils.py", line 204, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/modeling_utils.py", line 204, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/modeling_utils.py", line 201, in fn_recursive_set_mem_eff
    module.set_use_memory_efficient_attention_xformers(valid, attention_op)
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/attention.py", line 117, in set_use_memory_efficient_attention_xformers
    raise e
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/models/attention.py", line 111, in set_use_memory_efficient_attention_xformers
    _ = xformers.ops.memory_efficient_attention(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/xformers/ops/memory_efficient_attention.py", line 967, in memory_efficient_attention
    return op.forward_no_grad(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/xformers/ops/memory_efficient_attention.py", line 343, in forward_no_grad
    return cls.FORWARD_OPERATOR(
  File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

Wrapping a try ... except around pipe.enable_xformers_memory_efficient_attention() fixes that error as well, as I also already explained.

That's as far as I got fixing errors myself. Will continue testing / developing soon.

anzorq

Owner Jan 27, 2023

This space installs A10G-specific prebuilt xformers whl. To use it on T4 you need to either disable xformers or install xformers for T4.

johnslegers

Jan 28, 2023

This space installs A10G-specific prebuilt xformers whl. To use it on T4 you need to either disable xformers or install xformers for T4.

I kinda suspected that's why it broke on that line. I experienced a similar issue / the same issue with another app on Google Colab a while ago.

Either way, my fix will stop the app from breaking the moment the app is downgraded from A10G to T4.

This is especially important for people who - like myself - are running their environment on a community GPU grant, as the environment can be changed by Huggingface at any time without the author being aware of it.

If you don't care about this, that's fine, I guess. It's your app. But in that case you might want to make people aware your app is designed to work as-is on Huggingface on A10G environments only and will break on both T4 environments & CPU-only environments without making certain adjustments. This would save lots of people from headaches when trying to duplicate your space.

AdamOswald1

Feb 6, 2023

@johnslegers

I had found that switching all of the "torch_dtype=torch.float16" to "torch_dtype=torch.get_default_dtype()" allows the program to run on CPU, and removes the
" layernormkernelimpl not implemented for 'half' " error when running on CPU. Also, changing the requirements to download CPU compatible modules as you mentioned.

I sense that in order for the program to perform on both CPU and GPU a toggle like this might work:

"if device = GPU: torch_dtype=torch.float16 elif device = CPU: torch_dtype=torch.get_default_dtype()"

how should i do that last bit? the same way that i would switch all of the "torch_dtype=torch.float16" to "torch_dtype=torch.get_default_dtype()"?