Test

by Tianming - opened Jul 13, 2022

Jul 13, 2022

This comment has been hidden

Tianming changed discussion status to closed Jul 13, 2022

Tianming changed discussion title from The Pytorch source code of Albert seems not to implement "cross-layer parameter sharing", Could someone show me how the code implement it? to Test Jul 13, 2022

lysandre

ALBERT community org Jul 13, 2022

Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).

If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).

Now if we take a look at the implementation, you'll see the following:

The layers_per_group variable is defined as follows:

layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)

it will therefore be equal to the number of hidden layers, so 12.

The layer group ID that will be used afterwards is defined here:

group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))

This will always be 0: the Transformer will only iterate through the first layer group, which is a single layer repeated 12 times.

Hope that helps!

Gong

Jul 19, 2022

I use AutoModel and AutoConfig to load the model, use hungarface's built-in trainer for training, after the training end of the saved model each time loading print parameters, the parameter values are different, and if you load the saved model again for training, the model will become invalid

lysandre

ALBERT community org Jul 19, 2022

Hey @Gong ! Do you mind filling-in the bug template here with some reproducible code? This will help us get to the root of the problem with you. See you there!

julien-c

ALBERT community org Jul 20, 2022

(you can also post your issue URL here for future reference)

Tianming

Jan 3, 2023

•

edited Jan 3, 2023

Hey @Tianming , the ALBERT implementation does implement cross-layer parameter sharing exactly as the initial implementation did. The initial implementation works in terms of layer groups: you can have 12 layers but 1 layer group, which will mean one layer repeated 12 times, or you could have 12 layers with 3 layer groups (so 3 different layers, repeated 4 times each).

If you take a look at the configuration file (https://huggingface.co/albert-base-v2/blob/main/config.json#L23), you'll see that there's a single layer group for this model ("num_hidden_groups": 1).

Now if we take a look at the implementation, you'll see the following:

The layers_per_group variable is defined as follows:
layers_per_group = int(self.config.num_hidden_layers / self.config.num_hidden_groups)
it will therefore be equal to the number of hidden layers, so 12.

The layer group ID that will be used afterwards is defined here:
group_idx = int(i / (self.config.num_hidden_layers / self.config.num_hidden_groups))
This will always be 0: the Transformer will only iterate through the first layer group, which is a single layer repeated 12 times.

Hope that helps!

Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment