training_full
This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-full dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
The run_clm.py
script from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs:
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \
--nproc_per_node=2 run_clm.py --output_dir="./training_full" \
--model_type="gpt2" \
--config_name="./training" \
--tokenizer_name="./training" \
--dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-full" \
--do_train \
--per_device_train_batch_size 8 \
--block_size="1024" \
--learning_rate="5e-3" --warmup_steps="1000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="1" \
--logging_steps="500" \
--save_steps="5000" --preprocessing_num_workers="16" \
--gradient_accumulation_steps="4" --report_to="tensorboard" \
--logging_dir="./log_full" > command_full_log.log 2>&1 &
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.005
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 1.0
Training results
Evaluation results
Perplexity on random 2000 examples of the target language's Wikipedia dataset, using the code provided in the perplexity docs, with 512 tokes of stride:
Target language | PPL |
---|---|
en | 37.513710021972656 |
de | 24.629812240600586 |
es | 21.987037658691406 |
fr | 26.124969482421875 |
it | 26.723554611206055 |
pt | 21.162311553955078 |
nl | 32.36076736450195 |
The following script was used for evaluation
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm
import random
# Set the seed for reproducibility
random.seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-full"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
target_language_dataset = "20231101.de" # change here for other languages
dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
num_examples = 2000
random_numbers = list(np.random.randint(0, len(dataset), num_examples))
examples = []
for i in tqdm(random_numbers):
examples.append(dataset[int(i)]["text"])
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")
max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)
nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc # may be different from stride on last loop
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
# loss is calculated using CrossEntropyLoss which averages over valid labels
# N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
# to the left by 1.
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).mean())
print("Perplexity: ", ppl.item())
Framework versions
- Transformers 4.37.0.dev0
- Pytorch 1.13.0
- Datasets 2.16.0
- Tokenizers 0.15.0
- Downloads last month
- 21
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.