Pythia-2.8b supervised finetuned and DPO finetuned with the helpful subset of Anthropic-hh-rlhf dataset for 1 epoch.