Full SFT training caused lose its foundational capabilities

#71

by sinlew - opened Aug 1

Aug 1

After using the Transformers 4.43.3 model for SFT training, the MMLU score dropped from the original model's 67 points to 22 points. Why did this happen? With the same data, the llama3-8-instruct model only dropped to 46 points.

antony-pk

Aug 2

can you share the code link or some dataset link.

Please provide more context please.

sinlew

Aug 5

https://github.com/hiyouga/LLaMA-Factory/issues/5047
It's train code.
The dataset consists of chat logs from different individuals, totaling over 10,000 entries. Each chat log contains approximately 30-80 rounds formatted in the SUAUA pattern. The conversations are primarily casual and do not involve much specialized knowledge.

sinlew

Aug 5

I noticed that you have been fine-tuned before. Does your model have this issue?

antony-pk

Aug 5

•

edited Aug 5

I noticed that you have been fine-tuned before. Does your model have this issue?

Yes I'm also facing the same issue with the llama3 & llama3.1 I used the unsloth approach. But I didn't does the Full SFT. I did with very minimal data.

Note: I'm also raised the same kind of question check here please.
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/78
https://discuss.huggingface.co/t/my-adapter-model-dominating-the-entire-base-model/100577

Can we connect privatly if possible lets discuss through Gmeet or zoom.

antony-pk

Aug 5

•

edited Aug 5

@sinlew
Please have a look on this too

https://www.reddit.com/r/LocalLLaMA/comments/1dxdlwy/comment/lc12xo9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

antony-pk

Aug 6

https://discuss.huggingface.co/t/my-adapter-model-dominating-the-entire-base-model/100577

have a look on above url, in my case I used the standard format its working fine comparatively previous responses

sinlew

Oct 18

I replaced and optimized the training data to solve this problem. Simple dialogue fine-tuning on Llama 3.1 can significantly disrupt the foundational capabilities of the model. After increasing dialogue diversity, the model at least maintained some of its foundational abilities. However, the fine-tuned model still does not meet the expected capabilities.

antony-pk

Oct 18

@sinlew in my cause i used the lamama 3.1 & 3.2 both are working fine I used 10k dataset for the finetuning all working good. I want to know your lora settings and number of epoch you are using.

sinlew

Oct 22

•

edited Oct 22

@antony-pk I use 6 epoch. Have you run any evaluation tools, such as MMLU or MMLU Pro? Has there been a significant drop in scores?

antony-pk

Oct 23

For your information im not using the any MMLU or MMLU Pro, I used my own custom script

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment