Warn about config mismatch for pre-training

by nthngdy - opened Oct 6, 2022

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

-0

nthngdy

Oct 6, 2022

•

edited Oct 6, 2022

The model card did not warn the user about the fact that the config is not scaled properly for pre-training with the corresponding google/electra-small-discriminator.

Warn about config mismatch for pre-trainingd451554b

nthngdy

Oct 17, 2022

@lysandre please let me know what you think of this when you have some time!

lysandre

Google org Oct 24, 2022

So I took a look once again at the original code and checkpoints, and the ELECTRA framework offers some checkpoints to download. These checkpoints contain both the generator and discriminator, and these are the checkpoints that we isolated here.

Here's the table from the original codebase:

Model	Layers	Hidden Size	Params	GLUE score	Download
ELECTRA-Small	12	256	14M	77.4	link
ELECTRA-Base	12	768	110M	82.7	link
ELECTRA-Large	24	1024	335M	85.2	link

nthngdy

Oct 25, 2022

•

edited Oct 25, 2022

Okay I see! My main concern (and what got me stuck for a while) is that if you load ElectraForMaskedLM(ElectraConfig.from_pretrained("google/electra-small-generator")) you get an architecture that cannot be pre-trained with ElectraForPreTraining(ElectraConfig.from_pretrained("google/electra-small-discriminator")) because it leads to instability. Most of the hyperparameters in the generator are supposed to be 1/4th of the ones in the discriminator, but currently they are all equal.
I don't really get why the generator checkpoint is using the wrong parameters in the first place, but I think it might be helpful to have a warning somewhere to specify this.

lysandre

Google org Oct 25, 2022

Yes, I definitely understand where you're coming from and I think a warning is warranted, let's try to have it as helpful as possible. Where did you get the notion that the hyperparams of the generator are supposed to be 1/4th of the hyperparams of the discriminator? Is that in the paper? Thanks!

nthngdy

Oct 25, 2022

Yes you can see that in the paper! They need to divide hidden size, FFN size and the number of attention heads by some factor so that the model converges (that goes for every size). The intuition is that if the generator is too strong it will fool the discriminator easily and make the learning process impossible.

lysandre

Google org Oct 27, 2022

I see, indeed! In that case, would you be willing to edit your warning to mention that:

This is the official generator checkpoint as in the ELECTRA original codebase
However, the paper recommends a multiplier between the discriminator and generator of 1/4 for this given model, so using this off the shelf will likely result in training instabilities

Would this work for you?

Update README.md72aaf32c

Update README.mda050549b

nthngdy

Nov 2, 2022

I tried to make it more clear based on your comments. Please let me know if I can still improve the message!

lysandre

Google org Nov 2, 2022

Yes that's great! Thanks a lot @nthngdy .

lysandre changed pull request status to merged Nov 2, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment