TheBloke
/

Llama-2-70B-Chat-GGML

@@ -35,13 +35,22 @@ tags:
 This repo contains GGML format model files for [Meta's Llama 2 70B Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat).
-GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/ggerganov/llama.cpp) and libraries and UIs which support this format, such as:
-* [KoboldCpp](https://github.com/LostRuins/koboldcpp), a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling.
-* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), a great web UI with GPU acceleration via the c_transformers backend.
-* [LM Studio](https://lmstudio.ai/), a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel.
-* [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend.
-* [ctransformers](https://github.com/marella/ctransformers), a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server.
-* [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with GPU accel, LangChain support, and OpenAI-compatible API server.
 ## Repositories available
@@ -61,15 +70,11 @@ You are a helpful, respectful and honest assistant. Always answer as helpfully a
 <!-- compatibility_ggml start -->
 ## Compatibility
-### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
-These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.
-### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
-These new quantisation methods are compatible with llama.cpp as of June 6th, commit `2d43387`.
-They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
 ## Explanation of the new k-quant methods
 <details>
@@ -109,17 +114,11 @@ Refer to the Provided Files table below to see what files use which methods, and
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m llama-2-70b-chat.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
-Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
-Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
-If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
-## How to run in `text-generation-webui`
-Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
 <!-- footer start -->
 ## Discord
@@ -145,7 +144,6 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
 **Patreon special mentions**: Slarti, Chadd, John Detwiler, Pieter, zynix, K, Mano Prime, ReadyPlayerEmma, Ai Maven, Leonard Tan, Edmond Seymore, Joseph William Delisle, Luke @flexchar, Fred von Graf, Viktor Bowallius, Rishabh Srivastava, Nikolai Manek, Matthew Berman, Johann-Peter Hartmann, ya boyyy, Greatston Gnanesh, Femi Adebogun, Talal Aujan, Jonathan Leane, terasurfer, David Flickinger, William Sang, Ajan Kanaga, Vadim, Artur Olbinski, Raven Klaugh, Michael Levine, Oscar Rangel, Randy H, Cory Kujawski, RoA, Dave, Alex, Alexandros Triantafyllidis, Fen Risland, Eugene Pentland, vamX, Elle, Nathan LeClaire, Khalefa Al-Ahmad, Rainer Wilmers, subjectnull, Junyu Yang, Daniel P. Andersen, SuperWojo, LangChain4j, Mandus, Kalila, Illia Dulskyi, Trenton Dambrowitz, Asp the Wyvern, Derek Yates, Jeffrey Morgan, Deep Realms, Imad Khwaja, Pyrater, Preetika Verma, biorpg, Gabriel Tamborski, Stephen Murray, Spiking Neurons AB, Iucharbius, Chris Smitley, Willem Michiel, Luke Pendergrass, Sebastain Graf, senxiiz, Will Dee, Space Cruiser, Karl Bernard, Clay Pascal, Lone Striker, transmissions 11, webtim, WelcomeToTheClub, Sam, theTransient, Pierre Kircher, chris gileta, John Villwock, Sean Connelly, Willian Hasse
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->

 This repo contains GGML format model files for [Meta's Llama 2 70B Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat).
+## Only compatible with latest llama.cpp
+To use these files you need:
+1. llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
+ - For users who don't want to compile from source, you can use the binaries from [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630)
+2. to add new command line parameter `-gqa 8`
+Example command:
+```
+/workspace/git/llama.cpp/main -m llama-2-70b-chat/ggml/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
+```
+There is no CUDA support at this time, but it should hopefully be coming soon.
+There is no support in third-party UIs or Python libraries (llama-cpp-python, ctransformers) yet. That will come in due course.
 ## Repositories available
 <!-- compatibility_ggml start -->
 ## Compatibility
+### Only compatible with llama.cpp as of commit `e76d630`
+Compatible with llama.cpp as of [commit `e76d630`](https://github.com/ggerganov/llama.cpp/commit/e76d630df17e235e6b9ef416c45996765d2e36fb) or later.
+For a pre-compiled release, use [release master-e76d630](https://github.com/ggerganov/llama.cpp/releases/tag/master-e76d630) or later.
 ## Explanation of the new k-quant methods
 <details>
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -m llama-2-70b-chat/ggml/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
 ```
+Change `-t 13` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
+No GPU support is possible yet, but it is coming soon.
 <!-- footer start -->
 ## Discord
 **Patreon special mentions**: Slarti, Chadd, John Detwiler, Pieter, zynix, K, Mano Prime, ReadyPlayerEmma, Ai Maven, Leonard Tan, Edmond Seymore, Joseph William Delisle, Luke @flexchar, Fred von Graf, Viktor Bowallius, Rishabh Srivastava, Nikolai Manek, Matthew Berman, Johann-Peter Hartmann, ya boyyy, Greatston Gnanesh, Femi Adebogun, Talal Aujan, Jonathan Leane, terasurfer, David Flickinger, William Sang, Ajan Kanaga, Vadim, Artur Olbinski, Raven Klaugh, Michael Levine, Oscar Rangel, Randy H, Cory Kujawski, RoA, Dave, Alex, Alexandros Triantafyllidis, Fen Risland, Eugene Pentland, vamX, Elle, Nathan LeClaire, Khalefa Al-Ahmad, Rainer Wilmers, subjectnull, Junyu Yang, Daniel P. Andersen, SuperWojo, LangChain4j, Mandus, Kalila, Illia Dulskyi, Trenton Dambrowitz, Asp the Wyvern, Derek Yates, Jeffrey Morgan, Deep Realms, Imad Khwaja, Pyrater, Preetika Verma, biorpg, Gabriel Tamborski, Stephen Murray, Spiking Neurons AB, Iucharbius, Chris Smitley, Willem Michiel, Luke Pendergrass, Sebastain Graf, senxiiz, Will Dee, Space Cruiser, Karl Bernard, Clay Pascal, Lone Striker, transmissions 11, webtim, WelcomeToTheClub, Sam, theTransient, Pierre Kircher, chris gileta, John Villwock, Sean Connelly, Willian Hasse
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->