Test

#1
by Tianming - opened
This comment has been hidden
Tianming changed discussion status to closed
Tianming changed discussion title from The Pytorch source code of Albert seems not to implement "cross-layer parameter sharing", Could someone show me how the code implement it? to Test
ALBERT community org

Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).

If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).

Now if we take a look at the implementation, you'll see the following:

The layers_per_group variable is defined as follows:

layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)

it will therefore be equal to the number of hidden layers, so 12.

The layer group ID that will be used afterwards is defined here:

group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))

This will always be 0: the Transformer will only iterate through the first layer group, which is a single layer repeated 12 times.

Hope that helps!

I use AutoModel and AutoConfig to load the model, use hungarface's built-in trainer for training, after the training end of the saved model each time loading print parameters, the parameter values are different, and if you load the saved model again for training, the model will become invalid

ALBERT community org

Hey @Gong ! Do you mind filling-in the bug template here with some reproducible code? This will help us get to the root of the problem with you. See you there!

ALBERT community org

(you can also post your issue URL here for future reference)

Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).

If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).

Now if we take a look at the implementation, you'll see the following:

The layers_per_group variable is defined as follows:

layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)

it will therefore be equal to the number of hidden layers, so 12.

The layer group ID that will be used afterwards is defined here:

group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))

This will always be 0: the Transformer will only iterate through the first layer group, which is a single layer repeated 12 times.

Hope that helps!

Thanks.

Sign up or log in to comment