Adding new vocabulary into the model

by raptorkwok - opened Aug 12

Aug 12

I am curious how I can add more vocabularies into the model.

Currently, I use bart-base-chinese as a pre-trained model, but I found that the current 51,271 words are insufficient. From time to time, there are unknown tokens found. Say I have 50,000 more vocabularies to be added to the model, can you share some ideas on how I can achieve the goal?

Thanks.

Fudan NLP org Aug 19

You can add new tokens into the vacabulary just following the huggingface doc, like this:

# supposed the tokenizer and model are loaded from_pretrained()
tokenizer.add_tokens(["JU","AZ"]) 
model.resize_token_embeddings(len(tokenizer))

Note that the added tokens are untrained, which need to be further pre-trained or fine-tuned on additional datasets.

raptorkwok

Aug 28

•

edited Aug 29

May I know how to further pre-train / fine-tune?

Specifically, I saw from the README of pre-train:

dataset: Place the .bin and .idx files that preprocessed from raw text. I figured out this by reading the MEGATRON README file > Preprocess Data section.
vocab: Place the vocab files and model config file. I have saved the tokenizer file using .save_pretrained() function to generate the following files: added_tokens.json, special_tokens_map.json, tokenizer_config.json and vocab.txt. Are these files okay?
roberta_zh: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint. How do I do that? Use the .from_pretrained(), then .save_pretrained()?

Thanks in advance.

Fudan NLP org Aug 29

We provide code for fine-tuning at our github: https://github.com/fastnlp/CPT/finetune

You can use it for further pre-train or fine-tuning.

raptorkwok

Aug 30

•

edited Aug 30

Thanks for the reply. If I add more vocabularies to the model, I should first pre-train it, then fine-tune, am I correct? Thanks.

For the ./run_pretrain_bart.sh, it shows the error at an early stage:

[rank0]: IndexError: Caught IndexError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/jupyter-raptor/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/blendable_dataset.py", line 83, in __getitem__
[rank0]:     return self.datasets[dataset_idx][sample_idx]
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 106, in __getitem__
[rank0]:     return self.build_training_sample(sample, self.max_seq_length, np_rng)
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 148, in build_training_sample
[rank0]:     source = self.add_whole_word_mask(source, mask_ratio, replace_length)
[rank0]:   File "/home/jupyter-raptor/pretrain_tokenizer/megatron/data/bart_dataset.py", line 360, in add_whole_word_mask
[rank0]:     source[indices[mask_random]] = torch.randint(
[rank0]: IndexError: The shape of the mask [2] at index 0 does not match the shape of the indexed tensor [1] at index 0

The dataset is generated using Megatron's Preprocess Data method.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment