b4rtaz
/

Llama-3_1-405B-Q40-Instruct-Distributed-Llama

Text Generation

distributed-inference

Model card Files Files and versions Community

b4rtaz commited on Jul 31, 2024

Commit

5672a7d

•

1 Parent(s): 9b010dd

Update README.md

Files changed (1) hide show

README.md +31 -3

README.md CHANGED Viewed

@@ -1,3 +1,31 @@
----
-license: llama3.1
----

+---
+license: llama3.1
+tags:
+- distributed-inference
+- text-generation
+---
+This is converted **Llama 3.1 405B Instruct** model to [Distributed Llama](https://github.com/b4rtaz/distributed-llama) format. The model is quantized to Q40 format. Due to Huggingface limitations, the model is split into 56 parts. Before use, you need to combine the parts together.
+To run this model, you need approximately 240 GB of RAM on a single device, or approximately 240 GB distributed across 2, 4, 8, or 16 devices connected in a cluster (more informations how to do it you can find [here](https://github.com/b4rtaz/distributed-llama)).
+## 🚀 How to Run?
+1. ⏬ Download the model. You have two options:
+  * Download this repository and combine all parts together by using the `cat` command.
+  * Download the model by using the `launch.py` script from Distributed Llama repository: `python launch.py llama3_1_405b_instruct_q40`
+4. ⏬ Download [Distributed Llama](https://github.com/b4rtaz/distributed-llama) repository.
+5. 🔨 Build Distributed Llama:
+```
+make dllama
+```
+4. 🚀 Run Distributed Llama:
+```
+./dllama chat --model dllama_model_llama31_405b_q40.m --tokenizer dllama_tokenizer_llama_3_1.t --buffer-float-type q80 --max-seq-len 2048 --nthreads 64
+```
+## 🎩 License
+You need to accept the Llama 3.1 license before downloading this model.
+[Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)