Out of curiosity, how much resources are needed?

#2
by pmova - opened

Hello,
firstly, thank you for providing the SVDQ models, secondly, I would like to ask what type of hardware (especially the memory capacity) is being used, and how long does it take to calculate the (FP4, INT4) result on it? When I was looking into possible calculation on original scripts, I got the impression that 16GB cards (even a combo of two of them) are not possible to be used in reasonable time.

Lastly, does the calculation store intermediate projection matrices for the 32 singular values (rank)? I was wondering if it would be be possible to use the same projection matrix for other flux models skipping this step, or use it as a base and skip some iterations, but I was never able to find one published for the original reductions.

Regards,
P.M.

I use a RTX Pro 6000 card (96GB Vram), it varies during processing but it can spike up to 76GB with my current settings. It takes around 24 hours for each quant to be processed. You can speed it up if you do smaller batch sizes or skip smoothing, but quality takes a hit. I tried to make it work on an A40 (48GB) and I couldn't get the settings where I wanted them and maintain the quality. The critical setting is the amount of calibration samples, if you lower that below 32 the quality seems to take a hit in the evaluations.

I would agree that two 16gb cards are not enough.

Yes, from what I understand it stores projection matrices in branch.pt. I suppose it's possible in theory to re-use them as long as the models are exactly the same architecture and layer dimensions. This is an interesting idea, because calibration low rank branches takes around 12-13 hours per quant. Deepcompressor has the flag --load-from which might pickup on a branch.pt file in a folder. Very much worth considering because it would save a lot of time.

Just to follow up on this. You can re-use branch.pt between runs using --load-from, but with some restrictions. The model architecture needs to be exactly the same, if a merge is based on Flux Dev, you can't reuse it for Flux Krea. And similarly you can't reuse it between INT4 and FP4 quants (makes sense, but I tested anyway). So this is a time saver under the right circumstances.

Sign up or log in to comment