NeoBERT Model
This is a NeoBERT model trained with pszemraj/NeoBERT and exported to transformers
format.
Model Details
- Architecture: NeoBERT
- Hidden Size: 768
- Layers: 12
- Attention Heads: 12
- Vocab Size: 31999
- Max Length: 4096
- Dtype: float32
Usage
For Masked Language Modeling (Fill-Mask)
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
repo_id = "BEE-spoke-data/neobert-100k-test"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo_id, trust_remote_code=True)
# Example: Fill in masked tokens
text = "NeoBERT is the most [MASK] model of its kind!"
# Tokenize (handling Metaspace tokenizer's space tokens)
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"][0].tolist()
# Remove extra space tokens before [MASK] if present (Metaspace tokenizer quirk)
cleaned_ids = []
for i, token_id in enumerate(input_ids):
if token_id == 454 and i < len(input_ids) - 1 and input_ids[i + 1] == tokenizer.mask_token_id:
continue
cleaned_ids.append(token_id)
if len(cleaned_ids) != len(input_ids):
inputs["input_ids"] = torch.tensor([cleaned_ids])
inputs["attention_mask"] = torch.ones_like(inputs["input_ids"])
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1][0]
predictions = outputs.logits[0, mask_pos].topk(5)
# Display top predictions
for idx, score in zip(predictions.indices, predictions.values):
token = tokenizer.decode([idx])
print(f"{token}: {score:.2f}")
For Embeddings / Feature Extraction
from transformers import AutoModel, AutoTokenizer
repo_id = "BEE-spoke-data/neobert-100k-test"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
# Example: Generate embeddings
text = "NeoBERT is an efficient transformer model!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get CLS token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}")
Training Configuration
Full Config (click to expand)
Full training config:
model:
hidden_size: 768
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
max_position_embeddings: 4096
vocab_size: 31999
rope: true
rms_norm: true
hidden_act: swiglu
dropout_prob: 0.05
norm_eps: 1.0e-05
embedding_init_range: 0.02
decoder_init_range: 0.02
classifier_init_range: 0.02
flash_attention: true
ngpt: false
base_scale: 0.03227486121839514
pad_token_id: 0
dataset:
name: EleutherAI/SmolLM2-1.7B-stage-4-100B
path: ''
num_workers: 4
streaming: true
cache_dir: null
max_seq_length: 1024
validation_split: null
train_split: train
eval_split: train[:1%]
num_proc: 8
shuffle_buffer_size: 10000
pre_tokenize: false
pre_tokenize_output: null
load_all_from_disk: false
force_redownload: false
pretraining_prob: 0.3
min_length: 512
tokenizer:
name: BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
path: null
max_length: 1024
padding: max_length
truncation: true
vocab_size: 31999
optimizer:
name: adamw
lr: 0.0001
weight_decay: 0.01
betas:
- 0.9
- 0.98
eps: 1.0e-08
scheduler:
name: cosine
warmup_steps: 5000
total_steps: null
num_cycles: 0.5
decay_steps: 50000
warmup_percent: null
decay_percent: null
trainer:
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 4
max_steps: 100000
save_steps: 10000
eval_steps: 5000
logging_steps: 25
output_dir: ./outputs/neobert_100m_100k
overwrite_output_dir: true
bf16: true
gradient_checkpointing: false
gradient_clipping: null
mixed_precision: 'no'
seed: 42
resume_from_checkpoint: false
disable_tqdm: false
dataloader_num_workers: 0
use_cpu: false
report_to:
- wandb
tf32: true
max_ckpt: 3
train_batch_size: 16
eval_batch_size: 32
datacollator:
mlm_probability: 0.2
pad_to_multiple_of: 8
wandb:
project: neobert-pretraining
entity: null
name: neobert-100m-100k
tags: []
mode: online
log_interval: 100
resume: never
dir: logs/wandb
task: pretraining
accelerate_config_file: null
mixed_precision: bf16
mteb_task_type: all
mteb_batch_size: 32
mteb_pooling: mean
mteb_overwrite_results: false
pretrained_checkpoint: latest
use_deepspeed: true
seed: 69
debug: false
- Downloads last month
- 56