Test
Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).
If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).
Now if we take a look at the implementation, you'll see the following:
The layers_per_group
variable is defined as follows:
layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)
it will therefore be equal to the number of hidden layers, so 12.
The layer group ID that will be used afterwards is defined here:
group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))
This will always be 0
: the Transformer
will only iterate through the first layer group, which is a single layer repeated 12 times.
Hope that helps!
I use AutoModel and AutoConfig to load the model, use hungarface's built-in trainer for training, after the training end of the saved model each time loading print parameters, the parameter values are different, and if you load the saved model again for training, the model will become invalid
(you can also post your issue URL here for future reference)
Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).
If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).
Now if we take a look at the implementation, you'll see the following:
The
layers_per_group
variable is defined as follows:layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)
it will therefore be equal to the number of hidden layers, so 12.
The layer group ID that will be used afterwards is defined here:
group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))
This will always be
0
: theTransformer
will only iterate through the first layer group, which is a single layer repeated 12 times.Hope that helps!
Thanks.