julien-c HF staff commited on
Commit
058adc6
1 Parent(s): feca4a8

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/surajp/SanBERTa/README.md

Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: sa
3
+ ---
4
+
5
+ # RoBERTa trained on Sanskrit (SanBERTa)
6
+
7
+ **Mode size** (after training): **340MB**
8
+
9
+ ### Dataset:
10
+
11
+ [Wikipedia articles](https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles) (used in [iNLTK](https://github.com/goru001/nlp-for-sanskrit)).
12
+ It contains evaluation set.
13
+
14
+ [Sanskrit scraps from CLTK](http://cltk.org/)
15
+
16
+ ### Configuration
17
+
18
+ | Parameter | Value |
19
+ |---|---|
20
+ | `num_attention_heads` | 12 |
21
+ | `num_hidden_layers` | 6 |
22
+ | `hidden_size` | 768 |
23
+ | `vocab_size` | 29407 |
24
+
25
+ ### Training :
26
+ - On TPU
27
+ - For language modelling
28
+ - Iteratively increasing `--block_size` from 128 to 256 over epochs
29
+
30
+ ### Evaluation
31
+
32
+ |Metric| # Value |
33
+ |---|---|
34
+ |Perplexity (`block_size=256`)|4.04|
35
+
36
+ ## Example of usage:
37
+
38
+ ### For Embeddings
39
+
40
+ ```
41
+
42
+ tokenizer = AutoTokenizer.from_pretrained("surajp/SanBERTa")
43
+ model = RobertaModel.from_pretrained("surajp/SanBERTa")
44
+
45
+ op = tokenizer.encode("इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।", return_tensors="pt")
46
+ ps = model(op)
47
+ ps[0].shape
48
+
49
+ ```
50
+ ```
51
+ '''
52
+ Output:
53
+ --------
54
+ torch.Size([1, 47, 768])
55
+
56
+ ```
57
+
58
+
59
+ ### For \<mask\> Prediction
60
+
61
+ ```
62
+ from transformers import pipeline
63
+
64
+ fill_mask = pipeline(
65
+ "fill-mask",
66
+ model="surajp/SanBERTa",
67
+ tokenizer="surajp/SanBERTa"
68
+ )
69
+
70
+ ## इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।
71
+ fill_mask("इयं भाषा न केवल<mask> भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।")
72
+
73
+ ps = model(torch.tensor(enc).unsqueeze(1))
74
+ print(ps[0].shape)
75
+ ```
76
+ ```
77
+ '''
78
+ Output:
79
+ --------
80
+ [{'score': 0.7516744136810303,
81
+ 'sequence': '<s> इयं भाषा न केवलं भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
82
+ 'token': 280,
83
+ 'token_str': 'à¤Ĥ'},
84
+ {'score': 0.06230105459690094,
85
+ 'sequence': '<s> इयं भाषा न केवली भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
86
+ 'token': 289,
87
+ 'token_str': 'à¥Ģ'},
88
+ {'score': 0.055410224944353104,
89
+ 'sequence': '<s> इयं भाषा न केवला भारतस्य अपि तु विश्वस्य प्राचीनतमा भाषा इति मन्यते।</s>',
90
+ 'token': 265,
91
+ 'token_str': 'ा'},
92
+ ...]
93
+ ```
94
+
95
+ ### It works!! 🎉 🎉 🎉
96
+
97
+ > Created by [Suraj Parmar/@parmarsuraj99](https://twitter.com/parmarsuraj99) | [LinkedIn](https://www.linkedin.com/in/parmarsuraj99/)
98
+
99
+ > Made with <span style="color: #e25555;">&hearts;</span> in India