HuggingFace Inference Endpoints Issue (Detailed Information)

#4
by blevlabs - opened

Hello, I saw this topic was related to one of the closed tickets, and that the user did not provide adequate information. I wanted to share my attempt to host it on the inference endpoint as I have been trying to use this model.

Here is an output:

2023/12/15 10:03:22 ~ {"timestamp":"2023-12-15T15:03:22.451927Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2023/12/15 10:03:22 ~ {"timestamp":"2023-12-15T15:03:22.452217Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2023/12/15 10:03:22 ~ {"timestamp":"2023-12-15T15:03:22.452186Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2023/12/15 10:03:22 ~ {"timestamp":"2023-12-15T15:03:22.452239Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2023/12/15 10:03:22 ~ {"timestamp":"2023-12-15T15:03:22.452655Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.517688Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.524783Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.526299Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.529150Z","level":"WARN","fields":{"message":"Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.537123Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.543898Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.545572Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.548302Z","level":"WARN","fields":{"message":"Could not import Mistral model: Mistral model requires flash attn v2\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.591579Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.596882Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.598776Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:25 ~ {"timestamp":"2023-12-15T15:03:25.601267Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher"}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.157644Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\n\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.157779Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\n\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.157415Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\n\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.256251Z","level":"ERROR","fields":{"message":"Shard 3 failed to start"},"target":"text_generation_launcher"}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.256295Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2023/12/15 10:03:26 ~ {"timestamp":"2023-12-15T15:03:26.257847Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 291, in get_model\n    raise ValueError(\"sharded is not supported for AutoModel\")\n\nValueError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}


2023/12/15 10:03:26 ~ Error: ShardCannotStart
2023/12/15 10:04:10 ~ {"timestamp":"2023-12-15T15:04:10.264390Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
2023/12/15 10:04:10 ~ {"timestamp":"2023-12-15T15:04:10.264345Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 1512, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 2048, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: \"levlabs404b-aws-dolphin-2-5-mixtral-8x7b-696-69796f6f98-j6xmn\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
2023/12/15 10:04:10 ~ {"timestamp":"2023-12-15T15:04:10.264479Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}

I have the full log file here, as HuggingFace does not support uploading files to issues/conversations: https://file.io/TDOE1DBMwnVJ

Please let me know if there is any more information I can provide to assist in this process. Thank you for your time and efforts into building these brilliant models!

I wonder if this is related to my issue with vllm. The files might have to be in the safetensors format.

Perhaps. I am using the default configuration available. Plus, it looks like regular Mistral is functioning fine on Inference Endpoints, just not this fine-tuned variant

I am also having exactly the same issue

Same issue here

Ditto... Tried to upgrade the endpoint to A100, and got this message: Quota exceeded for p4de. Currently available: 0, requested: 1. Please contact us at api-enterprise@huggingface.co to increase your quota.

Looks like A100 endpoints are ALL in use.

This is when I was trying to run it on an A100, I do not think it is related to that. I also tried different and no quantized configurations, same issue.

@ehartford Do you have any thoughts here?

Cognitive Computations org

Inference if not my area of expertise

So I asked Grok the question: "What version of CUDA does Flash Attention V2 require?" with this response: "Flash Attention V2 requires CUDA 11.0 or later. However, a recent update to the library (Dec 13) has added an automatic fallback from v2 to v1 depth map library for older GPUs, allowing for better compatibility." it also provided a link to this X posting: https://twitter.com/polarization_yu/status/1734939306807157243

So my question now is pretty basic.. what AWS instances support CUDA 11.0 or later? The T4 seems to be at 7.5, which appears to be useless. I suspect that the 4XA100 instance is at CUDA 11.0 or higher. I'm thinking that this problem can be solved with a newer docker image for AWS, but it's not clear how to do this.

Cognitive Computations org

I doubt you will find help for aws here.

I tried a good half dozen different AWS container images that had newer versions of CUDA, using the custom container configuration. Not a single one started properly, running into one error or another. My advice is to avoid trying to create an inference endpoint until somebody figures out the right recipe to make this model run. Avoid this model for now, unless you can run it locally on your own machine.

023-12-19 10:09:13 INFO:Loading ehartford_dolphin-2.5-mixtral-8x7b...β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 3.30G /4.22G 41.7MiB/s
2023-12-19 10:09:13 ERROR:Failed to load the model.β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 4.01G /4.22G 62.5MiB/s
Traceback (most recent call last):β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 4.22G /4.22G 89.0MiB/s
pinokio\api\oobabooga.pinokio.git\text-generation-webui\modules\ui_model_menu.py", line 209, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
i\pinokio\api\oobabooga.pinokio.git\text-generation-webui\modules\models.py", line 85, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pinokio\api\oobabooga.pinokio.git\text-generation-webui\modules\models.py", line 142, in huggingface_loader
config = AutoConfig.from_pretrained(path_to_model, trust_remote_code=params['trust_remote_code'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\pinokio\api\oobabooga.pinokio.git\text-generation-webui\installer_files\env\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 1064, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^

\pinokio\api\oobabooga.pinokio.git\text-generation-webui\installer_files\env\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 761, in getitem
raise KeyError(key)
KeyError: 'mixtral'

any body knows what is the issue ?

Tried again to start Dolphin on an A100 inference endpoint, which DOES support Flash Attention V2.... But it's still NO GOOD. Sharding manager fails now.

67bb6bbb9fn7r4s 2023-12-20T02:36:52.737Z {"timestamp":"2023-12-20T02:36:52.736867Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__\n return self.main(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main\n return _main(\n File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main\n rv = self.invoke(ctx)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke\n return __callback(*args, **kwargs)\n File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete\n self.run_forever()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever\n self._run_once()\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once\n handle._run()\n File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 336, in get_model\n raise ValueError(f"Unsupported model type {model_type}")\nValueError: Unsupported model type mixtral\n"},"target":"text_generation_launcher"}

67bb6bbb9fn7r4s 2023-12-20T02:36:53.230Z {"timestamp":"2023-12-20T02:36:53.230152Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nTraceback (most recent call last):\n\n File "/opt/conda/bin/text-generation-server", line 8, in \n sys.exit(app())\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve\n server.serve(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve\n asyncio.run(\n\n File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run\n return loop.run_until_complete(main)\n\n File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete\n return future.result()\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner\n model = get_model(\n\n File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 336, in get_model\n raise ValueError(f"Unsupported model type {model_type}")\n\nValueError: Unsupported model type mixtral\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
67bb6bbb9fn7r4s 2023-12-20T02:36:53.328Z {"timestamp":"2023-12-20T02:36:53.328062Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
67bb6bbb9fn7r4s 2023-12-20T02:36:53.328Z Error: ShardCannotStart
67bb6bbb9fn7r4s 2023-12-20T02:36:53.328Z {"timestamp":"2023-12-20T02:36:53.328003Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}

Sign up or log in to comment