Training scripts ?
Hi,
we were using this model for training of Icelandic Homographs. The results were quite good. See https://github.com/grammatek/IceHoc.
I'd be interested in the training scripts of this LM. Especially if it comes to dataset preparation and cleaning. Would you share those scripts ?
Kv,
Daniel.
Hi Daniel,
Happy to hear that the model performed so well on homograph classification. When pre-training the model, I followed Stefan Schweter's instructions:
https://github.com/stefan-it/turkish-bert/blob/master/convbert/CHEATSHEET.md
https://github.com/stefan-it/turkish-bert/blob/master/electra/CHEATSHEET.md
I used the pre-training script from the ConvBERT repository. Since the pre-training corpus (i.e., the Icelandic Gigaword Corpus) doesn't contain any web-crawled or noisy documents, I didn't perform any filtering or cleaning beforehand.
Best regards,
Jón