Base model: https://huggingface.co/indiejoseph/cantonese-llama-2-7b-oasst-v1
Finetuned following ALMA (https://github.com/fe1ixxu/ALMA) on the Cantonese-Mandarin translation task.
Finetuning dataset: Sourced from the released raw dataset in https://github.com/meganndare/cantonese-nlp
As the base model was already finetuned on Cantonese monolingual data, we only conducted finetuning on parallel sentences.
Results:
Man -> Can: 35.371 BLEU, 26.197 ChrF++
Can -> Man: 36.553 BLEU, 27.471 ChrF++
Github Repo: https://github.com/cmgao/nlp_project
The ALMA code was linked as submodule.