kimsiun commited on
Commit
d11dfd9
1 Parent(s): bc26505

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,73 @@
 
1
  ---
 
 
 
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
+ language: ko
4
+ tags:
5
+ - bert
6
+ - korean-english
7
+ - clinical nlp
8
+ - pharmacovigilance
9
+ - adverse events
10
  license: mit
11
  ---
12
+
13
+ # KAERS-BERT
14
+
15
+ ## Model Description
16
+
17
+ KAERS-BERT is a domain-specific Korean BERT model specialized for clinical text analysis, particularly for processing adverse drug event (ADE) narratives. It was developed by pretraining KoBERT (developed by SK Telecom) using 1.2 million ADE narratives reported through the Korea Adverse Event Reporting System (KAERS) between January 2015 and December 2019.
18
+
19
+ The model is specifically designed to handle clinical texts where code-switching between Korean and English is frequent, making it particularly effective for processing medical terms and abbreviations in a bilingual context.
20
+
21
+ ## Key Features
22
+
23
+ - Specialized in clinical and pharmaceutical domain text
24
+ - Handles Korean-English code-switching common in medical texts
25
+ - Optimized for processing adverse drug event narratives
26
+ - Built upon KoBERT architecture with domain-specific pretraining
27
+
28
+ ## Training Data
29
+
30
+ The model was pretrained on:
31
+ - 1.2 million ADE narratives from KAERS
32
+ - Training data specifically focused on 'disease history in detail' and 'adverse event in detail' sections
33
+ - Masked language modeling with 15% token masking rate
34
+ - Maximum sequence length of 200
35
+ - Learning rate: 5×10^-5
36
+
37
+ ## Performance
38
+
39
+ The model demonstrated strong performance in various NLP tasks related to drug safety information extraction:
40
+ - Named Entity Recognition (NER): 83.81% F1-score
41
+ - Sentence Extraction: 76.62% F1-score
42
+ - Relation Extraction: 64.37% F1-score (weighted)
43
+ - Label Classification:
44
+ - 'Occurred' Label: 81.33% F1-score
45
+ - 'Concerned' Label: 77.62% F1-score
46
+
47
+ When applied to the KAERS database, the model achieved an average increase of 3.24% in data completeness for structured data fields.
48
+
49
+ ## Intended Use
50
+
51
+ This model is designed for:
52
+ - Extracting drug safety information from clinical narratives
53
+ - Processing Korean medical texts with English medical terminology
54
+ - Supporting pharmacovigilance activities
55
+ - Improving data quality in adverse event reporting systems
56
+
57
+ ## Limitations
58
+
59
+ - The model is specifically trained on adverse event narratives and may not generalize well to other clinical domains
60
+ - Performance may vary for texts significantly different from KAERS narratives
61
+ - The model works best with Korean clinical texts containing English medical terminology
62
+
63
+ ## Citation
64
+
65
+ ```bibtex
66
+ @article{kim2023automatic,
67
+ title={Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques},
68
+ author={Kim, Siun and Kang, Taegwan and Chung, Tae Kyu and Choi, Yoona and Hong, YeSol and Jung, Kyomin and Lee, Howard},
69
+ journal={Drug Safety},
70
+ volume={46},
71
+ pages={781--795},
72
+ year={2023}
73
+ }
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "monologg/kobert",
3
+ "architectures": [
4
+ "BertForPreTraining"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.31.0.dev0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 8002
26
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b00afd760a17c718872dffa81038a8f708672e4e132de711bf7b27d41e43cbea
3
+ size 371225993
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": {
6
+ "content": "[MASK]",
7
+ "lstrip": true,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "[PAD]",
13
+ "sep_token": "[SEP]",
14
+ "unk_token": "[UNK]"
15
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17dc471055592d3cc9e0a5831e769246a8a001a4d27551c9ed79668173c7b407
3
+ size 371427
tokenizer_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": null,
3
+ "bos_token": "[CLS]",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "[CLS]",
6
+ "do_lower_case": false,
7
+ "eos_token": "[SEP]",
8
+ "keep_accents": false,
9
+ "mask_token": {
10
+ "__type": "AddedToken",
11
+ "content": "[MASK]",
12
+ "lstrip": true,
13
+ "normalized": true,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ "model_max_length": 1000000000000000019884624838656,
18
+ "pad_token": "[PAD]",
19
+ "remove_space": true,
20
+ "sep_token": "[SEP]",
21
+ "sp_model_kwargs": {},
22
+ "tokenizer_class": "KoBERTTokenizer",
23
+ "unk_token": "[UNK]"
24
+ }