papercup-ai/multilingual-pl-bert

I notice that the vocabulary contains sub-word tokens.

...
{'word': '##م', 'token': 7030},
{'word': '##di', 'token': 7033},
{'word': '##kan', 'token': 7036},
{'word': '##ek', 'token': 7037},
{'word': '##ak', 'token': 7040},
{'word': '##ı', 'token': 7042},
{'word': '##lo', 'token': 7044},
{'word': '##ung', 'token': 7045},
...

But for PL-BERT, isn't it necessary to have 'word-level' tokens? Or am I misinterpreting here?

Another reference thread which discussed this -> https://github.com/yl4579/StyleTTS2/issues/286#issuecomment-2383835836

So does this mean I can use a tokenizer like Indic-BERT to build my token_maps for Indic-PL-BERT??

papercup-ai
/

multilingual-pl-bert

Tokenizer Type