b4rtaz/Llama-3_3-70B-Q40-Instruct-Distributed-Llama

This is converted Llama 3.3 70B Instruct model to Distributed Llama format. The model is quantized to Q40 format. Due to Huggingface limitations, the model is split into 11 parts. Before use, you need to combine the parts together.

To run this model, you need approximately 42 GB of RAM on a single device, or approximately 42 GB of RAM distributed across 2, 4, 8, or 16 devices connected in a cluster (more informations how to do it you can find here).

🚀 How to Run?

⏬ Download the model. You have two options:

Download this repository and combine all parts together by using the cat command.
Download the model by using the launch.py script from Distributed Llama repository: python launch.py llama3_3_70b_instruct_q40

⏬ Download Distributed Llama repository.
🔨 Build Distributed Llama:

make dllama

🚀 Run Distributed Llama:

./dllama chat --model llama3_3_70b_instruct_q40.m --tokenizer dllama_tokenizer_llama_3_3.t --buffer-float-type q80 --max-seq-len 2048 --nthreads 64

🎩 License

You need to accept the Llama 3.3 license before downloading this model.

Llama 3.3 Community License