license: llama3.1
base_model: ArliAI/Llama-3.1-8B-ArliAI-RPMax-v1.3
tags:
- llama-cpp
- gguf-my-repo
Triangle104/Llama-3.1-8B-ArliAI-RPMax-v1.3-Q4_K_S-GGUF
This model was converted to GGUF format from ArliAI/Llama-3.1-8B-ArliAI-RPMax-v1.3
using llama.cpp via the ggml.ai's GGUF-my-repo space.
Refer to the original model card for more details on the model.
Model details:
RPMax is a series of models that are trained on a diverse set of curated creative writing and RP datasets with a focus on variety and deduplication. This model is designed to be highly creative and non-repetitive by making sure no two entries in the dataset have repeated characters or situations, which makes sure the model does not latch on to a certain personality and be capable of understanding and acting appropriately to any characters or situations.
Many RPMax users mentioned that these models does not feel like any other RP models, having a different writing style and generally doesn't feel in-bred.
You can access the model at https://arliai.com and we also have a models ranking page at https://www.arliai.com/models-ranking
Ask questions in our new Discord Server https://discord.com/invite/t75KbPgwhk or on our subreddit https://www.reddit.com/r/ArliAI/
Model Description
Llama-3.1-8B-ArliAI-RPMax-v1.3 is a variant made from the Llama-3.1-8B-Instruct model.
Let us know what you think of the model! The different parameter versions are based on different models, so they might all behave slightly differently in their own way.
v1.3 updated models are trained with updated software and configs such as the updated transformers library that fixes the gradient checkpointing bug which should help the model learn better. This version also uses RSLORA+ for training which helps the model learn even better.
Specs
Context Length: 128K Parameters: 8B
Training Details
Sequence Length: 8192 Training Duration: Approximately 10 hours on 2x3090Ti Epochs: 1 epoch training for minimized repetition sickness RS-QLORA+: 64-rank 64-alpha, resulting in ~2% trainable weights Learning Rate: 0.00001 Gradient accumulation: Very low 32 for better learning.
Suggested Prompt Format
Meta Llama 3 Instruct Format
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are [character]. You have a personality of [personality description]. [Describe scenario]<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
RPMax: Reduced repetition and higher creativity model
The goal of RPMax is to reduce repetitions and increase the models ability to creatively write in different situations presented to it. What this means is it is a model that will output responses very differently without falling into predictable tropes even in different situations.
What is repetition and creativity?
First of all, creativity should mean the variety in output that the model is capable of creating. You should not confuse creativity with writing prose. When a model writes in a way that can be said to be pleasant like writers would write in a novel, this is not creative writing. This is just a model having a certain pleasant type of writing prose. So a model that writes nicely is not necessarily a creative model.
Repetition and creativity are essentially intertwined with each other, so if a model is repetitive then a model can also be said to be un-creative as it cannot write new things and can only repeat similar responses that it has created before. For repetition there are actually two very different forms of repetition.
In-context repetition: When people mention a model is repetitive, this usually mean a model that likes to repeat the same phrases in a single conversation. An example of this is when a model says that a character "flicks her hair and...." and then starts to prepend that "flicks her hair and..." into every other action that character does.
It can be said that the model is boring, but even in real people's writing it is possible that this kind of repetition could be intentional to subtly prove a point or showcase a character's traits in some scenarios. So this type of repetition is not always bad and completely discouraging a model from doing this does not always lead to improve a model's writing ability.
Cross-context repetition: A second arguably worse type of repetition is a model's tendency to repeat the same phrases or tropes in very different situations. An example is a model that likes to repeat the infamous "shivers down my spine" phrase in wildly different conversations that don't necessarily fit with that phrase.
This type of repetition is ALWAYS bad as it is a sign that the model has over-fitted into that style of "creative writing" that it has often seen in the training dataset. A model's tendency to have cross-context repetition is also usually visible in how a model likes to choose similar repetitive names when writing stories. Such as the infamous "elara" and "whispering woods" names.
With RPMax v1 the main goal is to create a highly creative model by reducing reducing cross-context repetition, as that is the type of repetition that follows you through different conversations. This is also a type of repetition that can be combated by making sure your dataset does not have repetitions of the same situations or characters in different example entries.
Dataset Curation
RPMax is successful thanks to the training method and training dataset that was created for these models' fine-tuning. It contains as many open source creative writing and RP datasets that can be found (mostly from Hugging Face), from which have been curated to weed out datasets that are purely synthetic generations as they often only serve to dumb down the model and make the model learn GPT-isms (slop) rather than help.
Then Llama 3.1 8B is used to create a database of the characters and situations that are portrayed in these datasets, which is then used to de-dupe these datasets to make sure that there is only a single entry of any character or situation.
The Golden Rule of Fine-Tuning
Unlike the initial pre-training stage where the more data you throw at it the better it becomes for the most part, the golden rule for fine-tuning models isn't quantity, but instead quality over quantity. So the dataset for RPMax is actually orders of magnitude smaller than it would be if it included repeated characters and situations in the dataset, but the end result is a model that does not feel like just another remix of any other creative writing/RP model.
Training Parameters
RPMax's training parameters are also a different approach to other fine-tunes. The usual way is to have a low learning rate and high gradient accumulation for better loss stability, and then run multiple epochs of the training run until the loss is acceptable.
RPMax's Unconventional Approach
RPMax, on the other hand, is only trained for one single epoch, uses a low gradient accumulation, and a higher than normal learning rate. The loss curve during training is actually unstable and jumps up and down a lot, but if you smooth it out, it is actually still steadily decreasing over time although never end up at a very low loss value. The theory is that this allows the models to learn from each individual example in the dataset much more, and by not showing the model the same example twice, it will stop the model from latching on and reinforcing a single character or story trope.
The jumping up and down of loss during training is because as the model gets trained on a new entry from the dataset, the model will have never seen a similar example before and therefore can't really predict an answer similar to the example entry. While, the relatively high end loss of 1.0 or slightly above for RPMax models is actually good because the goal was never to create a model that can output exactly like the dataset that is being used to train it. Rather to create a model that is creative enough to make up it's own style of responses.
This is different from training a model in a particular domain and needing the model to reliably be able to output like the example dataset, such as when training a model on a company's internal knowledge base.
Difference between versions?
v1.0 had some mistakes in the training parameters, hence why not many versions of it were created.
v1.1 fixed the previous errors and is the version where many different base models were used in order to compare and figure out which models are most ideal for RPMax. The consensus is that Mistral based models were fantastic for RPMax as they are by far the most uncensored by default. On the other hand, Gemma seems to also have a quite interesting writing style, but on the other hand it had a lot of issues with running and training and the general low interest in it. Llama 3.1 based models also seem to do well, with 70B being having the lowest loss at the end of the training runs.
v1.2 was a fix of the dataset, where it was found that there are many entries that contained broken or otherwise nonsensical system prompts or messages in the example conversations. Training the model on v1.2 predictable made them better at following instructions and staying coherent.
v1.3 was not meant to be created, but due to the gradient checkpointing bug being found recently and training frameworks finally getting updated with the fix, it sounds like a good excuse to run a v1.3 of RPMax. This version is a focus on improving the training parameters, this time training was done using rsLoRA+ or rank-stabilized low rank adaptation with the addition of LoRA plus. These additions improved the models learning quite considerably, with the models all achieving lower loss than the previous iteration and outputting better quality outputs in real usage.
Real Success?
RPMax models have been out for a few months at this point, with versions v1.0 all the way to the now new v1.3. So far it seems like RPMax have been a resounding success in that achieves it's original goal of being a new creative writing/RP model that does not write like other RP finetunes. A lot of users of it mentioned it kind of almost feels like interacting with a real person when in an RP scenario, and that it does impressively unexpected things in their stories that caught them off guard in a good way.
Is it the best model there is? Probably not, but there isn't ever one single best model. So try it out for yourself and maybe you will like it! As always any feedback on the model is always appreciated and will be taken into account for the next versions.
Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)
brew install llama.cpp
Invoke the llama.cpp server or the CLI.
CLI:
llama-cli --hf-repo Triangle104/Llama-3.1-8B-ArliAI-RPMax-v1.3-Q4_K_S-GGUF --hf-file llama-3.1-8b-arliai-rpmax-v1.3-q4_k_s.gguf -p "The meaning to life and the universe is"
Server:
llama-server --hf-repo Triangle104/Llama-3.1-8B-ArliAI-RPMax-v1.3-Q4_K_S-GGUF --hf-file llama-3.1-8b-arliai-rpmax-v1.3-q4_k_s.gguf -c 2048
Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1
flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
cd llama.cpp && LLAMA_CURL=1 make
Step 3: Run inference through the main binary.
./llama-cli --hf-repo Triangle104/Llama-3.1-8B-ArliAI-RPMax-v1.3-Q4_K_S-GGUF --hf-file llama-3.1-8b-arliai-rpmax-v1.3-q4_k_s.gguf -p "The meaning to life and the universe is"
or
./llama-server --hf-repo Triangle104/Llama-3.1-8B-ArliAI-RPMax-v1.3-Q4_K_S-GGUF --hf-file llama-3.1-8b-arliai-rpmax-v1.3-q4_k_s.gguf -c 2048