Group size is 128 or 1 for main branch?

#36
by brendanlui - opened

According to Readme.md for the main branch (https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ#provided-files),

Branch	Bits	Group Size	Act Order (desc_act)	File Size	ExLlama Compatible?	Made With	Description
main	4	128	False	35.33 GB	True	AutoGPTQ	Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.

but the file name is gptq_model-4bit--1g.safetensors rather than gptq_model-4bit--128g.safetensors. Thus, which one is correct?

Sorry, README is wrong - main branch is groupsize -1 (no group size). I'll fix that

Could you clarify whether only the main branch supports GPTQ-for-LLaMa, as other branches seem to be unable? I've attempted to use TGI to start up the gptq_model-4bit--1g.safetensors, which worked fine, but using 2 GPUs failed due to the groupsize not being >= 2. I am seeking a version with a groupsize >= 2. However, my attempts to start other branches through TGI have failed.

That's confusing. I thought it was the exact opposite - that the main branch wouldn't work with TGI because for this model I used an old GPTQ-for-LLaMa version, and that all the others would work because they were made with AutoGPTQ. Actually no, I made all these with AutoGPTQ so I would expect them all to work.

What problems do you have with the ones in the other branches?

Just to note, I'm using TGI v0.9.4.

I encounter a 'ShardCannotStart' error, yet it works fine when I initiate using the main branch and a single GPU.
However, for instance, with 'gptq-4bit-128g-actorder_True' and 2 GPUs:

{"timestamp":"2023-08-18T09:05:32.861563Z","level":"INFO","fields":{"message":"Args { model_id: \"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 4096, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 8192, max_batch_total_tokens: Some(8192), max_waiting_tokens: 20, hostname: \"0.0.0.0\", port: 1234, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: Some(\"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\"), disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861603Z","level":"INFO","fields":{"message":"Sharding model on 2 processes"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861714Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:42.632849Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:44.881424Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:44.881694Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:05:44.881742Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.654342Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 180, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 150, in serve_inner\n    create_exllama_buffers()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py\", line 52, in create_exllama_buffers\n    prepare_buffers(DEVICE, temp_state, temp_dq)\n\nTypeError: prepare_buffers(): incompatible function arguments. The following argument types are supported:\n    1. (arg0: torch.device, arg1: torch.Tensor, arg2: torch.Tensor) -> None\n\nInvoked with: None, tensor([[0.]], dtype=torch.float16), tensor([[0.]], dtype=torch.float16)\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.745107Z","level":"ERROR","fields":{"message":"Shard 1 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.745148Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.986644Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart

OK if it works for one GPU then it's not an issue with my GPTQs I think. I don't know what's required for sharding exactly. Could you raise it on the TGI Github

Thanks @TheBloke , the problem is resolved after I used the latest TGI code.

How about the group size in the main branch of Llama-2-13B-chat-GPTQ? As there is another branch called GPTQ-gptq-4bit-128g-actorder_True, is the only difference between these two branches "actorder"?

Yes that's correct. The model with act-order = True has higher quality, but in the past using act-order + group_size has caused performance problems for some GPTQ clients.

That may now be resolved, and I don't know if it ever affected TGI.

So try 128g + True first and only use 128g + False if performance seems slow. In future I may make 128g + True the 'main' model, or even drop 128 + False entirely, if the performance issues are confirmed to be resolved.

@TheBloke Do you have any recommendation about which hyperparameters we should use to have the fastest inference speed for GPTQ models? As I did an experiment on TGI with both quantized and non-quantized LLaMa-2 models, I'm confused why the GPTQ models always have slower inference speed for the same request body, and the GPU memory usage is almost similar between every model on TGI. FYI, I'm using A100 80GB for testing.

Model No. of GPU(s) Parameters Quantization Method Bits GPTQ Group Size ExLlama Compatible? Processing time / request GPU Memory Used Sharded
Llama-2-7b-chat-hf 2 7B - 16 - - 4.00 s 147.1 GB -
Llama-2-7b-chat-hf 1 7B - 16 - - 3.10 s 78.7 GB -
Llama-2-7b-chat-hf 1 7B - 16 - - 11.30 s 78.8 GB False
Llama-2-7b-Chat-GPTQ (main) 1 7B GPTQ 4 128 Yes 4.50 s 79.3 GB -
Llama-2-7b-Chat-GPTQ (main) 1 7B GPTQ 4 128 Yes 11.35 s 79.3 GB False
Llama-2-13b-chat-hf 1 13B - 16 - - 5.35 s 78.4 GB -
Llama-2-13B-chat-GPTQ (main) 1 13B GPTQ 4 128 Yes 8.80 s 79.1 GB -
Llama-2-13B-chat-GPTQ (gptq-4bit-128g-actorder_True) 1 13B GPTQ 4 128 Yes - Not Enough Memory -
Llama-2-13B-chat-GPTQ (gptq-8bit--1g-actorder_True) 1 13B GPTQ 8 -1 No 11.35 s 78.8 GB -
Llama-2-13B-chat-GPTQ (gptq-8bit-128g-actorder_False) 1 13B GPTQ 8 128 No 11.75 s 78.7 GB -
Llama-2-70b-chat-hf 2 70B - 16 - - 11.4 s 159.5 GB -
Llama-2-70b-chat-hf 1 70B bitsandbytes 4 - - 35.5 s 74.2 GB -
Llama-2-70B-chat-GPTQ (main) 1 70B GPTQ 4 -1 Yes 23.95 s 77.8 GB -
Llama-2-70B-chat-GPTQ (main) 1 70B GPTQ 4 -1 Yes 23.95 s 77.8 GB False
Llama-2-70B-chat-GPTQ (gptq-4bit-32g-actorder_True) 2 70B GPTQ 4 32 Yes 33.8 s 86.12 GB -
Llama-2-70B-chat-GPTQ (gptq-4bit-128g-actorder_True) 1 70B GPTQ 4 128 Yes - Not Enough Memory -

Sign up or log in to comment