Loss = 0 and Gradient = NaN in ModernBERT Fine-Tuning for Regression
#63
by
saran1999
- opened
I am facing an issue while fine-tuning ModernBERT for a regression task. I get a loss of 0 and NaN gradients, but this problem does not occur when using BERT. I have pre-trained this model from scratch on my domain dataset.
- Flash attention is disabled.
- Tried changing FP16 to True and False, problem still occurs.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6e-06, 'epoch': 0.0}
Model Architecture:
class ModernBertForRegression(ModernBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = ModernBertModel(config)
self.regressor = nn.Linear(config.hidden_size, 1)
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs[0][:, 0, :] # CLS token embedding
predictions = self.regressor(pooled_output)
loss = None
if labels is not None:
loss_fct = nn.MSELoss()
loss = loss_fct(predictions.squeeze(), labels)
return loss, predictions
Training Arguments:
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=2,
learning_rate=1e-4,
fp16=True,
logging_dir="./logs",
)
Would appreciate any suggestions on why this could be happening...
This should probably fix your issue
class ModernBertForRegression(ModernBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = ModernBertModel(config)
self.head = ModernBertPredictionHead(config)
self.regressor = nn.Linear(config.hidden_size, 1)
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
last_hidden_state_cls = outputs[0][:, 0]
pooled_output = self.head(last_hidden_state_cls)
predictions = self.regressor(pooled_output)
loss = None
if labels is not None:
loss_fct = nn.MSELoss()
loss = loss_fct(predictions.squeeze(), labels)
return loss, predictions
This comment has been hidden
saran1999
changed discussion status to
closed
saran1999
changed discussion status to
open