Are there two identical embedding tensors, even though embeddings are shared?
The SmolLM models have tied embeddings (config.tie_word_embeddings = True
), i.e. input and output embeddings are shared.
However, the model contains two identical embedding tensors lm_head.weight
and model.embed_tokens.weight
. This seems to defeat the purpose of having shared (aka tied) embeddings. Any ideas?
Clicking on the arrow on the right-hand side (see screenshot below) shows a summary of the parameters without the lm_head
tensor, see the second screenshot below (model.norm.weight
is the last tensor shown):
Here is code showing that the model contains two identical tensors for input and output embeddings:
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained('HuggingFaceTB/SmolLM-135M')
print(torch.equal(model.lm_head.weight, model.model.embed_tokens.weight))
Above code returns True
, because the two tensors are identical.
The issue might be in modeling_llama.py, which doesn't seem to fully support shared embeddings. Specifically, line 1210
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
Contrast this for example with OpenELM's code that seems to fully support tied and untied embeddings on line 878
if self.lm_head is None:
# shared
logits = F.linear(hidden_states, weight=self.transformer.token_embeddings.weight)
else:
logits = self.lm_head(hidden_states)