Adding new vocabulary into the model
I am curious how I can add more vocabularies into the model.
Currently, I use bart-base-chinese as a pre-trained model, but I found that the current 51,271 words are insufficient. From time to time, there are unknown tokens found. Say I have 50,000 more vocabularies to be added to the model, can you share some ideas on how I can achieve the goal?
Thanks.
You can add new tokens into the vacabulary just following the huggingface doc, like this:
# supposed the tokenizer and model are loaded from_pretrained()
tokenizer.add_tokens(["JU","AZ"])
model.resize_token_embeddings(len(tokenizer))
Note that the added tokens are untrained, which need to be further pre-trained or fine-tuned on additional datasets.
May I know how to further pre-train / fine-tune?
Specifically, I saw from the README of pre-train:
dataset: Place the .bin and .idx files that preprocessed from raw text. I figured out this by reading the MEGATRON README file > Preprocess Data section.
vocab: Place the vocab files and model config file. I have saved the tokenizer file using
.save_pretrained()
function to generate the following files:added_tokens.json
,special_tokens_map.json
,tokenizer_config.json
andvocab.txt
. Are these files okay?roberta_zh: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint. How do I do that? Use the
.from_pretrained()
, then.save_pretrained()
?
Thanks in advance.
We provide code for fine-tuning at our github: https://github.com/fastnlp/CPT/finetune
You can use it for further pre-train or fine-tuning.
Thanks for the reply. If I add more vocabularies to the model, I should first pre-train it, then fine-tune, am I correct? Thanks.
For the ./run_pretrain_bart.sh
, it shows the error at an early stage:
[rank0]: IndexError: Caught IndexError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]: File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank0]: File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank0]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]: File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/blendable_dataset.py", line 83, in __getitem__
[rank0]: return self.datasets[dataset_idx][sample_idx]
[rank0]: File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 106, in __getitem__
[rank0]: return self.build_training_sample(sample, self.max_seq_length, np_rng)
[rank0]: File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 148, in build_training_sample
[rank0]: source = self.add_whole_word_mask(source, mask_ratio, replace_length)
[rank0]: File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 360, in add_whole_word_mask
[rank0]: source[indices[mask_random]] = torch.randint(
[rank0]: IndexError: The shape of the mask [2] at index 0 does not match the shape of the indexed tensor [1] at index 0
The dataset is generated using Megatron's Preprocess Data method.