b4rtaz
/

Llama-3_3-70B-Q40-Instruct-Distributed-Llama

Text Generation

distributed-inference

Model card Files Files and versions Community

Llama-3_3-70B-Q40-Instruct-Distributed-Llama / README.md

b4rtaz's picture

Update README.md

25e0b95 verified 16 days ago

|

history blame contribute delete

1.45 kB

	---
	license: llama3.3
	tags:
	- distributed-inference
	- text-generation
	---

	This is converted Llama 3.3 70B Instruct model to [Distributed Llama](https://github.com/b4rtaz/distributed-llama) format. The model is quantized to Q40 format. Due to Huggingface limitations, the model is split into 11 parts. Before use, you need to combine the parts together.

	To run this model, you need approximately 42 GB of RAM on a single device, or approximately 42 GB of RAM distributed across 2, 4, 8, or 16 devices connected in a cluster (more informations how to do it you can find [here](https://github.com/b4rtaz/distributed-llama)).

	## 🚀 How to Run?

	1. ⏬ Download the model. You have two options:
	* Download this repository and combine all parts together by using the `cat` command.
	* Download the model by using the `launch.py` script from Distributed Llama repository: `python launch.py llama3_3_70b_instruct_q40`
	4. ⏬ Download [Distributed Llama](https://github.com/b4rtaz/distributed-llama) repository.
	5. 🔨 Build Distributed Llama:
	```
	make dllama
	```
	4. 🚀 Run Distributed Llama:
	```
	./dllama chat --model llama3_3_70b_instruct_q40.m --tokenizer dllama_tokenizer_llama_3_3.t --buffer-float-type q80 --max-seq-len 2048 --nthreads 64
	```

	## 🎩 License

	You need to accept the Llama 3.3 license before downloading this model.

	[Llama 3.3 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE)