uestion about remerge procedure

#1
by TeeZee - opened

Hi @DavidAU , first of all, thank you for choosing my model :). I would like to incorporate the ultra quality step into my future merges, so, if I'm understanging it correctly, you took the mergekit configuration, used for DarkSapling and you changed the precision(dtype) to float32?

Hey:

Thanks for making sure a great model. Dark Forest 1, 2 also received ultra too.

RE: Merges -> All core files are downloaded, then merge(s) redone in float32.

Perplexity and real world testing are done.
This is a check and balance in case of error(s) along the way in the previous processes.

If the model has component(s) that are also merge(s), the same process it done as far back as possible.
This complicates the first step(s).

With mergekit there is an issue of CPU VS GPU to also consider , as CPU math is a little more accurate.
It is unclear if GPU math differs (enough) between Nvidia and AMD too.
Server based "math" vs "local machine" math is also a possible issue.
(I do everything locally).
Likewise make sure python is up to date and tensor version.

Then GGML file is set at F32 -> GGUFS made.
(20B = 75GB)

NOTE: Imatrix issues are a problem , whereas the imatrix system is designed to "prune deadwood" ; it actually pruning now the more accurate parameters/weights.
Great care must be taken in choosing the correct imatrix dataset file(s).

General datasets like "wiki" are TOO STRONG and degrade the model's performance.

Likewise corrupt dataset(s) are also an issue - mainly to do with formatting - this confuses the model and leads to repeat issues and/or crashes in worse cases.
As of this date I have tested over 50 datasets (including created ones) - as well as including "corrupt ones - to determine finer tuning of the model via Imatrix.
Or better put: Reduce / control Imatrix and balance it better.

This is the next step , called Imatrix Plus 2 and Imatrix X.

A test repo is here:
https://huggingface.co/DavidAU/DarkSapling-V2-Ultra-Quality-7B-GGUF-imatPLUS2-imatX

I used 7B because it is faster, after running spot tests on 20B models.

Awesome explanation, thanks! Do you have some preferred dataset for Imatrix calculation for RP/ERP models?

I will be posting some shortly.
Here are some key issues:

1 - DO NOT use any dataset with "hard returns"
IE PDF-> BOOK-> TEXT where instead of "word wrap" a return is used to force formatting.
The hard returns must be removed.
Otherwise the results are basically ZERO or worse... a crash.
Hard returns can be removed in open office / libreoffice using find/replace and reg ex and "$" / "QQQQ" ; this is a straightforward, but not simple process.

2 - format of the data - should have clear and consistent separators. Long unwrapped lines are no issue ; weird assed formatting is.
Do a quick scroll after "repair" to fix any odd issues.
It it catches your eye , it will confuse the LLM.

3 - For a tonne of text files (most need "hard returns fixed") - including "rp" - see:
http://www.textfiles.com/directory.html

I have found (so far) that stories work very well (multiple in a text file separated by title) - look at 500k to 1 MB in size -> give you 400-600 chunks.
However there is an entire "rp" section ... so lots to choose from.

im my experiments short sories also worked well, thank you, Ill have plenty do to now ;)(datasets, remerges of DF 3.0 before releaasing it etc). Really great insights, @Nexesenex - this hread might be of interest to you also.

Thanks!

RE: DAtasets:
-> When using multiple datasets , application order is also critical.
-> Sometimes using a dataset that DOES NOT match the model makes sense if you want minimal changes, yet still have imatrix benefits.

I tried experiments and measured results - perplex and real world - and the differences are there (all cases).
Likewise, dataset/imatrix "maxes out" around Q4km,q5ks, with limited effect on Q5km and up.

Ultra Quality -> Q4km operates at or close to Q6 performance, Q5 and Q6 are above this.
Q8 seems "bland" for some reason in comparison to Q6.
Likewise "lower IQs/Q" ramp up in performance and are outsized affected by Imatrix/Datasets.

To give you an idea here is some test data from DarkSapling 7b V2 (32bit is "Ultra Quality"):

IMAT DATASET: fanfiction-com3 + default // 785 chunks // IQ3_XS [ F32: (-14983) || IMAT PLUS: OFF SCALE (-62706) ]

None 16 bit 14.1283 +/- 0.09702
None 32 bit 12.6300 +/- 0.08623
Imat all layers 6.3594 +/- 0.03722

IMAT DATASET: fanfiction-com3 + default // 785 chunks // IQ3_S [ F32: (-13235) || IMAT PLUS: OFF SCALE (-60723) ]

None 16 bit 13.6948 +/- 0.09466
None 32 bit 12.3713 +/- 0.08363
Imat all layers 6.2990 +/- 0.03658

IMAT DATASET: fanfiction-com3 + default // 785 chunks // Q4_K_S [ F32: (-161) || IMAT PLUS: Super (-438) ]

None 16 bit 6.2397 +/- 0.03650
None 32 bit 6.2236 +/- 0.03643
Imat all layers 6.1798 +/- 0.03623

IMAT DATASET: fanfiction-com3 + default // 785 chunks // Q4_K_M [ F32: (-285) || IMAT PLUS - Good (-139), X GREAT [ 4-5 ] ]

Filter: PPL: Output: Sentences: Observation(s):
None 16 bit 6.2098 +/- 0.03638 716 3::2::3 Science explanations, very distant, cold.
None 32 bit 6.1813 +/- 0.03614 358 1::1::1 Short but smart assed. Good sentence variations.
Imat all layers 6.1674 +/- 0.03615 946 4::4::3 [!] Personification. Similes, complex personification.

This is from some my test data in the last few days as I map out "Imatrix X" protocols.
The Imatrix X data is not listed here.

Sign up or log in to comment