Pre-training tasks for GBERT: meaning of CLS token
#4
by
slvnwhrl
- opened
Hi,
first of all, thanks for the helpful work! I wanted to ask about the language modeling task for the training of the GBERT models. If I understand correctly from your paper, the models are only trained on the masked modeling task (with whole words) and not on next sentence prediction?
I think the model checkpoint contains weights for CLS pooling layer which would only make sense if they were trained because it's only used for the next sentence prediction task? Huggingface's BertMaskedLM architecture (specified for this model) does not use the pooling layer (see here).
Thanks for your help!