Qwen3-Next-80B-A3B-Instruct-GGUF / README.md

Update README.md

3e28658 verified 11 days ago

4.56 kB

	---
	base_model:
	- Qwen/Qwen3-Next-80B-A3B-Instruct
	license: apache-2.0
	pipeline_tag: text-generation
	---

	## Recent update:

	added IQ4_XS

	## Qwen3-Next-80B-A3B-Instruct ❤️ llama.cpp

	The qwen_next PR (Pull Request #16095) was merged into the main branch and is in llama.cpp release b7186

	Homebrew is updated and you can just do:

	```bash
	brew upgrade llama.cpp
	```

	you may also just build from source:

	```bash
	git clone https://github.com/ggml-org/llama.cpp
	cd llama.cpp
	time cmake -B build
	time cmake --build build --config Release --parallel $(nproc --all)
	```

	The speed in tokens/second is decent and will be improved over time:

	for Q4_0 quant:

	on Macbook M4 Max:

	```
	prompt: 54 t/s gen: 11 t/s (CPU only ie -ngl 0)
	prompt: 41 t/s gen: 7 t/s (GPU only ie -ngl 99)
	```

	on NVIDIA CUDA L40S:

	```
	prompt: 127 t/s gen: 42 t/s GPU
	```

	## Recent update:

	added IQ4_NL, Q4_1, Q5_0

	added Q3_K_S, Q3_K_L, Q5_K_S

	## Update:
	I have tested some of these smaller models on NVIDIA with default CUDA compile
	with the excellent release from @cturan on NVIDIA L40S GPU.

	Since L40S GPU is 48GB VRAM, I was able to run Q2_K, Q3_K_M, Q4_K_S, Q4_0 and Q4_MXFP4_MOE:

	but Q4_K_M was too big.
	Although it works if using -ngl 45
	but it slowed down quite a bit.

	There may be a better way but did not have time to test.

	Was able to get a good speed of 53 tokens per second in the generation
	and 800 tokens per second in the prompt reading.

	```bash
	wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
	tar xf test.tar.gz
	cd llama.cpp-test

	# export PATH=/usr/local/cuda/bin:$PATH

	time cmake -B build -DGGML_CUDA=ON
	time cmake --build build --config Release --parallel $(nproc --all)
	```

	You may need to add /usr/local/cuda/bin to your PATH
	to find nvcc (Nvidia CUDA compiler)

	Building from source took about 7 minutes.

	For more detail on CUDA build see:
	https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda


	## Quantized Models:

	These quantized models were generated using the excellent pull request from @pwilkin
	[#16095](https://github.com/ggml-org/llama.cpp/pull/16095)
	on 2025-10-19 with commit `2fdbf16eb`.

	NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development.
	Speed and quality should improve over time.

	### How to build and run for MacOS

	```bash
	PR=16095
	git clone https://github.com/ggml-org/llama.cpp llama.cpp-PR-$PR
	cd llama.cpp-PR-$PR

	git fetch origin pull/$PR/head:pr-$PR
	git checkout pr-$PR

	time cmake -B build
	time cmake --build build --config Release --parallel $(nproc --all)
	```

	### Run examples

	Run with Hugging Face model:

	```bash
	build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF --prompt 'What is the capital of France?' --no-mmap -st
	```
	by default will download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

	To download:
	```bash
	wget https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/resolve/main/Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf
	```
	or
	```bash
	pip install hf_transfer 'huggingface_hub[cli]'
	hf download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf
	```

	Run with local model file:

	```bash
	build/bin/llama-cli -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf --prompt 'Write a paragraph about quantum computing' --no-mmap -st
	```

	### Example prompt and output

	User prompt:

	Write a paragraph about quantum computing

	Assistant output:

	Quantum computing represents a revolutionary leap in computational power by harnessing the principles of quantum mechanics, such as superposition and entanglement, to process information in fundamentally new ways. Unlike classical computers, which use bits that are either 0 or 1, quantum computers use quantum bits, or qubits, which can exist in a combination of both states simultaneously. This allows quantum computers to explore vast solution spaces in parallel, making them potentially exponentially faster for certain problems—like factoring large numbers, optimizing complex systems, or simulating molecular structures for drug discovery. While still in its early stages, with challenges including qubit stability, error correction, and scalability, quantum computing holds transformative promise for fields ranging from cryptography to artificial intelligence. As researchers and tech companies invest heavily in hardware and algorithmic development, the race to achieve practical, fault-tolerant quantum machines is accelerating, heralding a new era in computing technology.

	[end of text]