answerdotai/ModernBERT-base · Loss = 0 and Gradient = NaN in ModernBERT Fine-Tuning for Regression

saran1999

2 days ago

I am facing an issue while fine-tuning ModernBERT for a regression task. I get a loss of 0 and NaN gradients, but this problem does not occur when using BERT. I have pre-trained this model from scratch on my domain dataset.

Flash attention is disabled.
Tried changing FP16 to True and False, problem still occurs.

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6e-06, 'epoch': 0.0}

Model Architecture:

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[0][:, 0, :]  # CLS token embedding
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions

Training Arguments:

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=1e-4,
    fp16=True,
    logging_dir="./logs",
)

Would appreciate any suggestions on why this could be happening...

Tarok6

2 days ago

This should probably fix your issue

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.head = ModernBertPredictionHead(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state_cls = outputs[0][:, 0]
        pooled_output = self.head(last_hidden_state_cls)
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions

saran1999

2 days ago

This comment has been hidden

saran1999 changed discussion status to closed 2 days ago

saran1999 changed discussion status to open 2 days ago

saran1999

2 days ago

Hi @Tarok6 , thanks for your suggestion. I tried it out but I still get the same result.

I also tried updating my torch to 2.6.0 from 2.5.1, still getting the same result....