Warn about config mismatch for pre-training
The model card did not warn the user about the fact that the config is not scaled properly for pre-training with the corresponding google/electra-small-discriminator.
So I took a look once again at the original code and checkpoints, and the ELECTRA framework offers some checkpoints to download. These checkpoints contain both the generator and discriminator, and these are the checkpoints that we isolated here.
Here's the table from the original codebase:
Okay I see! My main concern (and what got me stuck for a while) is that if you load ElectraForMaskedLM(ElectraConfig.from_pretrained("google/electra-small-generator"))
you get an architecture that cannot be pre-trained with ElectraForPreTraining(ElectraConfig.from_pretrained("google/electra-small-discriminator"))
because it leads to instability. Most of the hyperparameters in the generator are supposed to be 1/4th of the ones in the discriminator, but currently they are all equal.
I don't really get why the generator checkpoint is using the wrong parameters in the first place, but I think it might be helpful to have a warning somewhere to specify this.
Yes, I definitely understand where you're coming from and I think a warning is warranted, let's try to have it as helpful as possible. Where did you get the notion that the hyperparams of the generator are supposed to be 1/4th of the hyperparams of the discriminator? Is that in the paper? Thanks!
Yes you can see that in the paper! They need to divide hidden size, FFN size and the number of attention heads by some factor so that the model converges (that goes for every size). The intuition is that if the generator is too strong it will fool the discriminator easily and make the learning process impossible.
I see, indeed! In that case, would you be willing to edit your warning to mention that:
- This is the official generator checkpoint as in the ELECTRA original codebase
- However, the paper recommends a multiplier between the discriminator and generator of 1/4 for this given model, so using this off the shelf will likely result in training instabilities
Would this work for you?
I tried to make it more clear based on your comments. Please let me know if I can still improve the message!