Which dataset did you use to further finetune the abliterated models?
Can you reveal more training details?
As mentioned in this article, abliteration degraded the model's quality, so we need to further finetune it to heal the harm brought by abliteration.
As you can see, the source model significantly outperforms Llama 3 8B Instruct. However, we observe a performance drop in the ablated version across all benchmarks. The ablation process successfully uncensored it but also degraded the model's quality.
To address this issue, an idea consists of further training our abliterated model to heal it. Like most fine-tuned models, Llama 3 8B Instruct is quite brittle when it comes to supervised fine-tuning. An additional SFT would likely break the model's performance.
Alternatively, preference alignment is quite light and shouldn't lobotomize our abliterated model. DPO is a good candidate here for its ease of use and good track record. To implement it, I used LazyAxolotl with the mlabonne/orpo-dpo-mix-40k dataset.
Did you do further finetune or not?
Temporary fine-tuning has not been done.
The link below mentions fine-tuning, Fine-tuning will be tried later.
Uncensor any LLM with abliteration