Tokenizer Type
#5
by
abhinand
- opened
I notice that the vocabulary contains sub-word tokens.
...
{'word': '##م', 'token': 7030},
{'word': '##di', 'token': 7033},
{'word': '##kan', 'token': 7036},
{'word': '##ek', 'token': 7037},
{'word': '##ak', 'token': 7040},
{'word': '##ı', 'token': 7042},
{'word': '##lo', 'token': 7044},
{'word': '##ung', 'token': 7045},
...
But for PL-BERT, isn't it necessary to have 'word-level' tokens? Or am I misinterpreting here?
Another reference thread which discussed this -> https://github.com/yl4579/StyleTTS2/issues/286#issuecomment-2383835836
So does this mean I can use a tokenizer like Indic-BERT to build my token_maps
for Indic-PL-BERT??