Optimal settings

#1
by hushpiper - opened

Just wanted to share my findings that, at least on GTPQ, Airochronos (and I suspect all Airo-based Llama 2 merges) will perform extremely poorly on any settings that don't include a significant Top A component and low-to-0 Top K, particularly as you get further into an RP where the pool of probable responses available to the model will shrink. The only preset I got good results on was TFS-with-Top-A, where I had to bump the Top A from 0.2 up to 0.4. Another person (whose prompts are less restrictive than mine) got good results from Best Guess after lowering Top K, adding in a little Top A, and then raising the temperature to counteract the relatively low Top A.

However you get there: Top A or die, and go easy on the Top K. (Or just enable Mirostat on ExLlama, that also seems to work.)

I suspect this is down to Llama 2 13B being underbaked in a way that reacts with the Airoboros dataset really strangely: it augments what the base model half-learned, but doesn't augment it enough or in the right ways to make up for its weaknesses, so while it makes the dataset more diverse, it also ends up creating this samey probability soup where no response is significantly more likely than any other response. Top K is all about cutting things down to only responses that meet a specific threshold of probability, so it has the effect of narrowing down the model's possible responses too aggressively. Top A is about gathering in a broad swathe of possible responses regardless of how likely they are overall, so it performs very well in this situation; higher temperature also has the effect of raising the number of possible responses that are available, so it can be helpful as well.

--That's my theory anyway. I think it'd be really useful for others to say how different settings have worked for their use cases. But in my tests and my discussions with others, raising Top A for this model has been a night and day difference in terms of how well it performs.

Sign up or log in to comment