neuralmagic/Mistral-Nemo-Instruct-2407-FP8 · 7900xtx torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Hello everyone,
I hope you’re doing well.
I try to run this model with vllm, but it seems that 7900xtx can't make it .
before try this fp8 model, i succeed to run Mistral-Nemo-Instruct-2407.
here is the information: ubuntu 24.04.1 ,rocm-6.2.2，
torch 2.6.0.dev20241029+rocm6.2
torchaudio 2.5.0.dev20241030+rocm6.2
torchvision 0.20.0.dev20241029+rocm6.2
vllm 0.6.3.post1+rocm624
Thank you in advance for your time and assistance!

RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241031-223916.pkl): torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

##here is the full console log:##
vllm serve neuralmagic/Mistral-Nemo-Instruct-2407-FP8 --tokenizer_mode auto --config_format auto --load_format auto --max-model-len 16000 --enable-chunked-prefill
WARNING 10-31 22:38:52 rocm.py:13] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
INFO 10-31 22:38:55 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 10-31 22:38:55 api_server.py:529] args: Namespace(subparser='serve', model_tag='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x75fd926e7e20>)
INFO 10-31 22:38:55 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/1b699738-083e-4440-85ea-88426b6470bc for IPC Path.
INFO 10-31 22:38:55 api_server.py:179] Started engine process with PID 84664
INFO 10-31 22:39:02 config.py:934] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 10-31 22:39:02 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 10-31 22:39:02 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-31 22:39:05 config.py:934] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
WARNING 10-31 22:39:05 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 10-31 22:39:05 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 10-31 22:39:05 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', speculative_config=None, tokenizer='neuralmagic/Mistral-Nemo-Instruct-2407-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Mistral-Nemo-Instruct-2407-FP8, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-31 22:39:06 selector.py:120] Using ROCmFlashAttention backend.
INFO 10-31 22:39:06 model_runner.py:1056] Starting to load model neuralmagic/Mistral-Nemo-Instruct-2407-FP8...
WARNING 10-31 22:39:06 registry.py:247] Model architecture 'MistralForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting VLLM_USE_TRITON_FLASH_ATTN=0
WARNING 10-31 22:39:06 fp8.py:47] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 10-31 22:39:07 selector.py:120] Using ROCmFlashAttention backend.
INFO 10-31 22:39:07 weight_utils.py:243] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.70s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:04<00:02, 2.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.67s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.50s/it]

INFO 10-31 22:39:16 model_runner.py:1067] Loading model weights took 12.9013 GB
INFO 10-31 22:39:16 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241031-223916.pkl...
INFO 10-31 22:39:16 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241031-223916.pkl.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/fish/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/worker/model_runner.py", line 1658, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/models/llama.py", line 558, in forward
model_output = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/models/llama.py", line 347, in forward
hidden_states, residual = layer(positions, hidden_states,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/models/llama.py", line 259, in forward
hidden_states = self.self_attn(positions=positions,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/models/llama.py", line 186, in forward
qkv, _ = self.qkv_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/layers/linear.py", line 371, in forward
output_parallel = self.quant_method.apply(self, input, bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/layers/quantization/fp8.py", line 272, in apply
return apply_fp8_linear(
^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 132, in apply_fp8_linear
output = torch._scaled_mm(qinput,
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/fish/vllm/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
^^^^
File "/home/fish/vllm/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/engine/llm_engine.py", line 348, in init
self._initialize_kv_caches()
File "/home/fish/vllm/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/home/fish/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/worker/model_runner.py", line 1305, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/home/fish/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241031-223916.pkl): torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+
[rank0]:[W1031 22:39:17.058584233 ProcessGroupNCCL.cpp:1394] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Traceback (most recent call last):
File "/home/fish/venv/bin/vllm", line 33, in
sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/home/fish/vllm/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/home/fish/venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/fish/venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/fish/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/fish/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start