Hardware Requirements

#86
by ShivanshMathur007 - opened

What is the exact hardware requirement to run mistralai/Mixtral-8x7B-Instruct-v0.1 locally on the machine or VM. Storage,RAM,GPU, cache/buffer etc. Please tell

deleted

it all depends on how fast you want to go.

like while inferencing the locally downloaded model it should have a speed of 5 tokens/sec. you can provide the details at different speed also it would be helpful

Also tell the minimum requirement. It would be helpful. On the Mixtral website ->
Mixtral requires 64GB of RAM and 2 GPUs, which increases the cost by a factor of 3 (1.3$/h vs. 4.5$/h). This is mentioned can anyone elaborate on this.

You can run it with 8-bit precision on one A100 (80GB) which costs ~$1.89/h on Runpod.

I failed to run on A100(40G):

INFO 07-30 01:44:07 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 122, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 148, in load_model
[rank0]: self.model = get_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 261, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _initialize_model
[rank0]: return model_class(config=model_config.hf_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 508, in init
[rank0]: self.model = MixtralModel(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 449, in init
[rank0]: self.layers = nn.ModuleList([
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 450, in
[rank0]: MixtralDecoderLayer(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 388, in init
[rank0]: self.block_sparse_moe = MixtralMoE(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 103, in init
[rank0]: self.w13_weight = nn.Parameter(torch.empty(self.num_total_experts,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU

Mistral AI_ org

Hi there, Mixtral 8x7b requires around 100GB of VRAM for full precision inference, to run on lower you will have to use quantization and run on lower precision.

Sign up or log in to comment