--- license: llama3.3 tags: - distributed-inference - text-generation --- This is converted **Llama 3.3 70B Instruct** model to [Distributed Llama](https://github.com/b4rtaz/distributed-llama) format. The model is quantized to Q40 format. Due to Huggingface limitations, the model is split into 11 parts. Before use, you need to combine the parts together. To run this model, you need approximately 42 GB of RAM on a single device, or approximately 42 GB of RAM distributed across 2, 4, 8, or 16 devices connected in a cluster (more informations how to do it you can find [here](https://github.com/b4rtaz/distributed-llama)). ## 🚀 How to Run? 1. ⏬ Download the model. You have two options: * Download this repository and combine all parts together by using the `cat` command. * Download the model by using the `launch.py` script from Distributed Llama repository: `python launch.py llama3_3_70b_instruct_q40` 4. ⏬ Download [Distributed Llama](https://github.com/b4rtaz/distributed-llama) repository. 5. 🔨 Build Distributed Llama: ``` make dllama ``` 4. 🚀 Run Distributed Llama: ``` ./dllama chat --model llama3_3_70b_instruct_q40.m --tokenizer dllama_tokenizer_llama_3_3.t --buffer-float-type q80 --max-seq-len 2048 --nthreads 64 ``` ## 🎩 License You need to accept the Llama 3.3 license before downloading this model. [Llama 3.3 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE)