ctoraman
/

RoBERTweetTurkCovid

Inference Endpoints

Model card Files Files and versions Community

ctoraman commited on Jun 9, 2022

Commit

7cb238c

•

1 Parent(s): eb43009

Upload README.md

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -1,3 +1,40 @@
 ---
 license: cc-by-nc-sa-4.0
 ---

 ---
+language:
+- tr
+tags:
+- roberta
 license: cc-by-nc-sa-4.0
 ---
+# RoBERTweetTurkCovid (uncased)
+Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
+The pretrained corpus is a Turkish tweets collection related to COVID-19. The details of the data can be found at this paper:
+https://arxiv.org/...
+Model architecture is similar to RoBERTa-base (12 layers, 12 heads, and 768 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 30k.
+The details of pretraining can be found at this paper:
+https://arxiv.org/...
+The following code can be used for model loading and tokenization, example max length (768) can be changed:
+```
+	model = AutoModel.from_pretrained([model_path])
+	#for sequence classification:
+	#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
+	tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
+	tokenizer.mask_token = "[MASK]"
+	tokenizer.cls_token = "[CLS]"
+	tokenizer.sep_token = "[SEP]"
+	tokenizer.pad_token = "[PAD]"
+	tokenizer.unk_token = "[UNK]"
+	tokenizer.bos_token = "[CLS]"
+	tokenizer.eos_token = "[SEP]"
+	tokenizer.model_max_length = 768
+```
+### BibTeX entry and citation info
+```bibtex
+@article{}
+```