This is converted Llama 3.1 405B Instruct model to Distributed Llama format. The model is quantized to Q40 format. Due to Huggingface limitations, the model is split into 56 parts. Before use, you need to combine the parts together.

To run this model, you need approximately 240 GB of RAM on a single device, or approximately 240 GB of RAM distributed across 2, 4, 8, or 16 devices connected in a cluster (more informations how to do it you can find here).

πŸš€ How to Run?

  1. ⏬ Download the model. You have two options:
  • Download this repository and combine all parts together by using the cat command.
  • Download the model by using the launch.py script from Distributed Llama repository: python launch.py llama3_1_405b_instruct_q40
  1. ⏬ Download Distributed Llama repository.
  2. πŸ”¨ Build Distributed Llama:
make dllama
  1. πŸš€ Run Distributed Llama:
./dllama chat --model dllama_model_llama31_405b_q40.m --tokenizer dllama_tokenizer_llama_3_1.t --buffer-float-type q80 --max-seq-len 2048 --nthreads 64

🎩 License

You need to accept the Llama 3.1 license before downloading this model.

Llama 3.1 Community License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .