Can't launch with TGI
I'm trying to launch the model with TGI but it's erroring out. Here is the command I'm running:
model=TheBloke/Llama-2-70B-chat-GPTQ
num_shard=1
volume=$PWD/data
quantize=gptq
docker run -d --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id $model --num-shard $num_shard --quantize $quantize
Am I doing doing something wrong? Here is the error I get:
2023-07-22T02:01:59.461306Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-22T02:01:59.461559Z INFO shard-manager: text_generation_launcher: Starting shard 0 rank=0
2023-07-22T02:02:03.443521Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 65, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 452, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 390, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 391, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 326, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 199, in __init__
self.query_key_value = _load_gqa(config, prefix, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 152, in _load_gqa
weights.get_sharded(f"{prefix}.q_proj.weight", dim=0),
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 98, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
Are you using the latest TGI code? They merged a fix for this error recently https://github.com/huggingface/text-generation-inference/pull/648
I'm using: https://github.com/huggingface/text-generation-inference/releases/tag/v0.9.3
This image specifically: ghcr.io/huggingface/text-generation-inference:0.9.3
I guess that doesn't have the fix?
Correct. 0.9.3 was released 4 days ago and the PR with the fix was merged 2 days ago.
I built it myself from the latest source and it worked. Thanks for the help.
Great to hear!
Has anyone been able to confirm that it's working? I am trying to deploy a similar model but now getting a phantom error with it.