license: llama2
language:
- en
tags:
- not-for-all-audiences
4.5 bpw/bits exl2 quantization of Venus-120b, using the measurement.json and dataset posted in the model page.
This size lets it fit on 72GB VRAM without using FP8 cache (on 3x24 GB, it uses about 69-70GB VRAM with context loaded)
Original model card
Venus 120b - version 1.0
Overview
The goal was to create a large model that's highly capable for RP/ERP scenarios. Goliath-120b is excellent for roleplay, and Venus-120b was created with the idea of attempting to mix more than two models together to see how well this method works.
Model Details
- A result of interleaving layers of Sao10K/Euryale-1.3-L2-70B, NousResearch/Nous-Hermes-Llama2-70b, and migtissera/SynthIA-70B-v1.5 using mergekit.
- The resulting model has 140 layers and approximately 122 billion parameters.
- See mergekit-config.yml for details on the merge method used.
- See the
exl2-*
branches for exllama2 quantizations. The 4.85 bpw quant should fit in 80GB VRAM, and the 3.0 bpw quant should (just barely) fit in 48GB VRAM with 4k context. - Inspired by Goliath-120b
Warning: This model will produce NSFW content!
Results
Initial tests show that Venus-120b functions fine, overall it seems to be comparable to Goliath-120b. Some differences I noticed:
- Venus needs lower temperature settings than Goliath. I recommend a temp of around 0.7, and no higher than 1.0.
- Venus tends to, on average, produce longer responses than Goliath. Probably due to the inclusion of SynthIA in the merge, which is trained to produce long chain-of-thought responses.
- Venus seems to be a bit less creative than Goliath when it comes to the prose it generates. Probably due to the lack of Xwin and the inclusion of Nous-Hermes.
Keep in mind this is all anecdotal from some basic tests. The key takeaway is that Venus shows that Goliath is not a fluke.