julien-c HF staff commited on
Commit
83894f7
·
1 Parent(s): e979988

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/ethanyt/guwenbert-large/README.md

Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "zh"
4
+ thumbnail: "https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png"
5
+ tags:
6
+ - "chinese"
7
+ - "classical chinese"
8
+ - "literary chinese"
9
+ - "ancient chinese"
10
+ - "bert"
11
+ - "pytorch"
12
+ license: "apache-2.0"
13
+ pipeline_tag: "fill-mask"
14
+ widget:
15
+ - text: "[MASK]太元中,武陵人捕鱼为业。"
16
+ - text: "问征夫以前路,恨晨光之[MASK]微。"
17
+ - text: "浔阳江头夜送客,枫叶[MASK]花秋瑟瑟。"
18
+ ---
19
+
20
+ # GuwenBERT
21
+
22
+ ## Model description
23
+ ![GuwenBERT](https://user-images.githubusercontent.com/9592150/97142000-cad08e00-179a-11eb-88df-aff9221482d8.png)
24
+
25
+ This is a RoBERTa model pre-trained on Classical Chinese. You can fine-tune GuwenBERT for downstream tasks, such as sentence breaking, punctuation, named entity recognition, and so on.
26
+
27
+ For more information about RoBERTa, take a look at the RoBERTa's offical repo.
28
+
29
+ ## How to use
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModel
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-large")
35
+
36
+ model = AutoModel.from_pretrained("ethanyt/guwenbert-large")
37
+ ```
38
+
39
+ ## Training data
40
+
41
+ The training data is daizhige dataset (殆知阁古代文献) which is contains of 15,694 books in Classical Chinese, covering Buddhism, Confucianism, Medicine, History, Zi, Yi, Yizang, Shizang, Taoism, and Jizang.
42
+ 76% of them are punctuated.
43
+ The total number of characters is 1.7B (1,743,337,673).
44
+ All traditional Characters are converted to simplified characters.
45
+ The vocabulary is constructed from this data set and the size is 23,292.
46
+
47
+ ## Training procedure
48
+
49
+ The models are initialized with `hfl/chinese-roberta-wwm-ext-large` and then pre-trained with a 2-step strategy.
50
+ In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.
51
+
52
+ The models are trained on 4 V100 GPUs for 120K steps (20K for step#1, 100K for step#2) with a batch size of 2,048 and a sequence length of 512. The optimizer used is Adam with a learning rate of 1e-4, adam-betas of (0.9,0.98), adam-eps of 1e-6, a weight decay of 0.01, learning rate warmup for 5K steps, and linear decay of learning rate after.
53
+
54
+ ## Eval results
55
+
56
+ ### "Gulian Cup" Ancient Books Named Entity Recognition Evaluation
57
+
58
+ Second place in the competition. Detailed test results:
59
+
60
+ | NE Type | Precision | Recall | F1 |
61
+ |:----------:|:-----------:|:------:|:-----:|
62
+ | Book Name | 77.50 | 73.73 | 75.57 |
63
+ | Other Name | 85.85 | 89.32 | 87.55 |
64
+ | Micro Avg. | 83.88 | 85.39 | 84.63 |
65
+
66
+
67
+
68
+
69
+ ## About Us
70
+
71
+ We are from [Datahammer](https://datahammer.net), Beijing Institute of Technology.
72
+ For more cooperation, please contact email: ethanyt [at] qq.com
73
+
74
+ > Created with ❤️ by Tan Yan [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/Ethan-yt) and Zewen Chi [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/CZWin32768)