README.md · LavaPlanet/Goliath120B-exl2

metadata

license: llama2

Another EXL2 version of AlpinDale's https://huggingface.co/alpindale/goliath-120b this one being at 2.64BPW and using the new experimental quant method of exllamav2.

Pippa llama2 Chat was used as the calibration dataset.

Can be run on two RTX 3090s w/ 24GB vram each.

Assuming Windows overhead, the following figures should be more or less close enough for estimation of your own use.

2.64BPW @ 4096 ctx
  Empty Ctx
    GPU Split:18/24
    GPU1: 19.8/24
    GPU2: 21.9/24
    10~ tk/s