TheBloke
/

LLongMA-2-7B-GGML

Transformers

llama

Model card Files Files and versions Community

TheBloke commited on Jul 20, 2023

Commit

0e2462e

•

1 Parent(s): f6df90c

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -2

README.md CHANGED Viewed

@@ -30,6 +30,16 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 * [ctransformers](https://github.com/marella/ctransformers), a Python library with LangChain support and OpenAI-compatible AI server.
 * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with OpenAI-compatible API server.
 ## Repositories available
@@ -37,7 +47,7 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/LLongMA-2-7B-GGML)
 * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/conceptofmind/LLongMA-2-7b)
-## Prompt template: Unknown
 ```
 {prompt}
@@ -97,7 +107,7 @@ Refer to the Provided Files table below to see what files use which methods, and
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m llongma-2-7b.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
@@ -177,3 +187,5 @@ If you have any questions about the data or model be sure to reach out and ask!
 The previous suite of LLongMA model releases can be found here: https://twitter.com/EnricoShippole/status/1677346578720256000?s=20
 All of the models can be found on Huggingface: https://huggingface.co/conceptofmind

 * [ctransformers](https://github.com/marella/ctransformers), a Python library with LangChain support and OpenAI-compatible AI server.
 * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), a Python library with OpenAI-compatible API server.
+## Extended context
+This is an extended context base Llama 2 model.  Please check if your GGML client supports extended context. llama.cpp and KoboldCpp do, but I have not verified the others.
+I believe the correct parameters for llama.cpp extended context are:
+```
+-c <contextsize> --rope-freq-base 10000 --rope-freq-scale 0.5"
+```
+I have tested these parameters and the answer is coherent, but I haven't yet confirmed if they're ideal. Please let me know in Discussions if you have feedback on that.
 ## Repositories available
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/LLongMA-2-7B-GGML)
 * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/conceptofmind/LLongMA-2-7b)
+## Prompt template: None
 ```
 {prompt}
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -t 10 -ngl 32 -m llongma-2-7b.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Llamas are very"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 The previous suite of LLongMA model releases can be found here: https://twitter.com/EnricoShippole/status/1677346578720256000?s=20
 All of the models can be found on Huggingface: https://huggingface.co/conceptofmind
+You can find the Llama-2 usage policy here: https://ai.meta.com/llama/use-policy/