Tokenizer Card for Ansh-160k!
The tokenizer model Ansh-160k
, is trained on a dataset of Wikipedia articles in 18 Indic languages and English. We propose the name Ansh as this
tokenizer is designed to meticulously identify every essential token (Ansh in Sanskrit) of our diverse Indic languages.
Model Description ๐
India is a vast country that has a multi-lingual culture that covers 22 Official languages and more than 1700 languages and dialects. It has been observed that various languages share words among themselves, sometimes even across language families. To capitalize on this observation, we trained our tokenization model with a vocabulary size of 160,000 (160k) using the dataset of Wikipedia articles in 18 Indic languages and English by applying the Byte-Pair Encoding (BPE) algorithm. When compared with IndicBERTv2 (vocabulary size of 250,000) on fertility scores, our model outperformed it in almost every language, with significant performance improvement in Sanskrit (sa), Kashmiri (ks), Sindhi (sd) and Konkani (gom).
- Developed by: Lingo Research Group at IIT Gandhinagar
- Language(s) (NLP): Multilingual (18 Indic Languages and English)
- License: Apache 2.0
How to Get Started with the Model ๐จ๐ปโ๐ป
Use the code below to get started with the model.
from transformers import AutoTokenizer
try:
tokenizer = tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/Ansh-160k"))
print("Tokenizer loaded successfully!")
except Exception as e:
print(f"Error loading tokenizer: {e}")
print("Please ensure you have the correct model name and are connected to the internet.")
exit()
input_text = "Hello, world! This is an example of how to use the tokenizer."
#input_text = 'เคฎเฅเคเฅ เคฏเคน presentation เคเคฒ morning เคคเค submit เคเคฐเคจเคพ เคนเฅเฅค '
#input_text = 'What is capital city of India?'
encoded_input = tokenizer.encode(example_text)
print("\nOriginal Text:", example_text)
print("Encoded (Token IDs):", encoded_input)
decoded_output = tokenizer.decode(encoded_input)
print("Decoded Text:", decoded_output)
Evaluation
[More Information Needed]
Results ๐
Tokenizers Results
Language | IndicBERTv2 | Ansh-160k |
---|---|---|
Tamil | 1.966 | 1.937 |
Kannada | 2.035 | 1.876 |
Malayalam | 2.202 | 2.073 |
Maithili | 1.534 | 1.270 |
Konkani | 2.145 | 1.741 |
Telugu | 1.803 | 1.713 |
Odia | 1.601 | 1.397 |
Bengali | 1.610 | 1.515 |
Nepali | 1.629 | 1.466 |
Punjabi | 1.458 | 1.445 |
Urdu | 1.565 | 1.383 |
Hindi | 1.456 | 1.364 |
Gujarati | 1.505 | 1.387 |
Kashmiri | 2.722 | 1.528 |
Marathi | 1.529 | 1.494 |
Sindhi | 1.740 | 1.380 |
Assamese | 1.677 | 1.562 |
Sanskrit | 2.821 | 1.950 |
English | 1.491 | 1.521 |
Model Card Contact โ๏ธ
Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in