Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/surajp/SanBERTa/README.md
README.md
ADDED
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: sa
|
3 |
+
---
|
4 |
+
|
5 |
+
# RoBERTa trained on Sanskrit (SanBERTa)
|
6 |
+
|
7 |
+
**Mode size** (after training): **340MB**
|
8 |
+
|
9 |
+
### Dataset:
|
10 |
+
|
11 |
+
[Wikipedia articles](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles) (used in [iNLTK](https://github.com/goru001/nlp-for-sanskrit)).
|
12 |
+
It contains evaluation set.
|
13 |
+
|
14 |
+
[Sanskrit scraps from CLTK](http://cltk.org/)
|
15 |
+
|
16 |
+
### Configuration
|
17 |
+
|
18 |
+
| Parameter | Value |
|
19 |
+
|---|---|
|
20 |
+
| `num_attention_heads` | 12 |
|
21 |
+
| `num_hidden_layers` | 6 |
|
22 |
+
| `hidden_size` | 768 |
|
23 |
+
| `vocab_size` | 29407 |
|
24 |
+
|
25 |
+
### Training :
|
26 |
+
- On TPU
|
27 |
+
- For language modelling
|
28 |
+
- Iteratively increasing `--block_size` from 128 to 256 over epochs
|
29 |
+
|
30 |
+
### Evaluation
|
31 |
+
|
32 |
+
|Metric| # Value |
|
33 |
+
|---|---|
|
34 |
+
|Perplexity (`block_size=256`)|4.04|
|
35 |
+
|
36 |
+
## Example of usage:
|
37 |
+
|
38 |
+
### For Embeddings
|
39 |
+
|
40 |
+
```
|
41 |
+
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
|
43 |
+
model = RobertaModel.from_pretrained("surajp/SanBERTa")
|
44 |
+
|
45 |
+
op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt")
|
46 |
+
ps = model(op)
|
47 |
+
ps[0].shape
|
48 |
+
|
49 |
+
```
|
50 |
+
```
|
51 |
+
'''
|
52 |
+
Output:
|
53 |
+
--------
|
54 |
+
torch.Size([1, 47, 768])
|
55 |
+
|
56 |
+
```
|
57 |
+
|
58 |
+
|
59 |
+
### For \<mask\> Prediction
|
60 |
+
|
61 |
+
```
|
62 |
+
from transformers import pipeline
|
63 |
+
|
64 |
+
fill_mask = pipeline(
|
65 |
+
"fill-mask",
|
66 |
+
model="surajp/SanBERTa",
|
67 |
+
tokenizer="surajp/SanBERTa"
|
68 |
+
)
|
69 |
+
|
70 |
+
## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
|
71 |
+
fill_mask("इयं भाषा न केवल<mask> भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।")
|
72 |
+
|
73 |
+
ps = model(torch.tensor(enc).unsqueeze(1))
|
74 |
+
print(ps[0].shape)
|
75 |
+
```
|
76 |
+
```
|
77 |
+
'''
|
78 |
+
Output:
|
79 |
+
--------
|
80 |
+
[{'score': 0.7516744136810303,
|
81 |
+
'sequence': '<s> इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
|
82 |
+
'token': 280,
|
83 |
+
'token_str': 'à¤Ĥ'},
|
84 |
+
{'score': 0.06230105459690094,
|
85 |
+
'sequence': '<s> इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
|
86 |
+
'token': 289,
|
87 |
+
'token_str': 'à¥Ģ'},
|
88 |
+
{'score': 0.055410224944353104,
|
89 |
+
'sequence': '<s> इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
|
90 |
+
'token': 265,
|
91 |
+
'token_str': 'ा'},
|
92 |
+
...]
|
93 |
+
```
|
94 |
+
|
95 |
+
### It works!! 🎉 🎉 🎉
|
96 |
+
|
97 |
+
> Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
|
98 |
+
|
99 |
+
> Made with <span style="color: #e25555;">♥</span> in India
|