julien-c HF staff commited on
Commit
1304b73
1 Parent(s): 69e45c9

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/aliosm/ai-soco-cpp-roberta-small/README.md

Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "c++"
3
+ tags:
4
+ - exbert
5
+ - authorship-identification
6
+ - fire2020
7
+ - pan2020
8
+ - ai-soco
9
+ license: "mit"
10
+ datasets:
11
+ - ai-soco
12
+ metrics:
13
+ - perplexity
14
+ ---
15
+
16
+ # ai-soco-c++-roberta-small
17
+
18
+ ## Model description
19
+
20
+ From scratch pre-trained RoBERTa model with 6 layers and 12 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website.
21
+
22
+ ## Intended uses & limitations
23
+
24
+ The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language.
25
+
26
+ #### How to use
27
+
28
+ You can use the model directly after tokenizing the text using the provided tokenizer with the model files.
29
+
30
+ #### Limitations and bias
31
+
32
+ The model is limited to C++ programming language only.
33
+
34
+ ## Training data
35
+
36
+ The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes.
37
+
38
+ ## Training procedure
39
+
40
+ The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 16\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization.
41
+
42
+ ### BibTeX entry and citation info
43
+
44
+ ```bibtex
45
+ @inproceedings{ai-soco-2020-fire,
46
+ title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}",
47
+ author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo",
48
+ booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)",
49
+ year = "2020"
50
+ }
51
+ ```
52
+
53
+ <a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-small">
54
+ <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
55
+ </a>