Upload folder using huggingface_hub
Browse files- README.md +70 -0
- config.json +26 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +15 -0
- spiece.model +3 -0
- tokenizer_config.json +24 -0
README.md
CHANGED
@@ -1,3 +1,73 @@
|
|
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
---
|
3 |
+
language: ko
|
4 |
+
tags:
|
5 |
+
- bert
|
6 |
+
- korean-english
|
7 |
+
- clinical nlp
|
8 |
+
- pharmacovigilance
|
9 |
+
- adverse events
|
10 |
license: mit
|
11 |
---
|
12 |
+
|
13 |
+
# KAERS-BERT
|
14 |
+
|
15 |
+
## Model Description
|
16 |
+
|
17 |
+
KAERS-BERT is a domain-specific Korean BERT model specialized for clinical text analysis, particularly for processing adverse drug event (ADE) narratives. It was developed by pretraining KoBERT (developed by SK Telecom) using 1.2 million ADE narratives reported through the Korea Adverse Event Reporting System (KAERS) between January 2015 and December 2019.
|
18 |
+
|
19 |
+
The model is specifically designed to handle clinical texts where code-switching between Korean and English is frequent, making it particularly effective for processing medical terms and abbreviations in a bilingual context.
|
20 |
+
|
21 |
+
## Key Features
|
22 |
+
|
23 |
+
- Specialized in clinical and pharmaceutical domain text
|
24 |
+
- Handles Korean-English code-switching common in medical texts
|
25 |
+
- Optimized for processing adverse drug event narratives
|
26 |
+
- Built upon KoBERT architecture with domain-specific pretraining
|
27 |
+
|
28 |
+
## Training Data
|
29 |
+
|
30 |
+
The model was pretrained on:
|
31 |
+
- 1.2 million ADE narratives from KAERS
|
32 |
+
- Training data specifically focused on 'disease history in detail' and 'adverse event in detail' sections
|
33 |
+
- Masked language modeling with 15% token masking rate
|
34 |
+
- Maximum sequence length of 200
|
35 |
+
- Learning rate: 5×10^-5
|
36 |
+
|
37 |
+
## Performance
|
38 |
+
|
39 |
+
The model demonstrated strong performance in various NLP tasks related to drug safety information extraction:
|
40 |
+
- Named Entity Recognition (NER): 83.81% F1-score
|
41 |
+
- Sentence Extraction: 76.62% F1-score
|
42 |
+
- Relation Extraction: 64.37% F1-score (weighted)
|
43 |
+
- Label Classification:
|
44 |
+
- 'Occurred' Label: 81.33% F1-score
|
45 |
+
- 'Concerned' Label: 77.62% F1-score
|
46 |
+
|
47 |
+
When applied to the KAERS database, the model achieved an average increase of 3.24% in data completeness for structured data fields.
|
48 |
+
|
49 |
+
## Intended Use
|
50 |
+
|
51 |
+
This model is designed for:
|
52 |
+
- Extracting drug safety information from clinical narratives
|
53 |
+
- Processing Korean medical texts with English medical terminology
|
54 |
+
- Supporting pharmacovigilance activities
|
55 |
+
- Improving data quality in adverse event reporting systems
|
56 |
+
|
57 |
+
## Limitations
|
58 |
+
|
59 |
+
- The model is specifically trained on adverse event narratives and may not generalize well to other clinical domains
|
60 |
+
- Performance may vary for texts significantly different from KAERS narratives
|
61 |
+
- The model works best with Korean clinical texts containing English medical terminology
|
62 |
+
|
63 |
+
## Citation
|
64 |
+
|
65 |
+
```bibtex
|
66 |
+
@article{kim2023automatic,
|
67 |
+
title={Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques},
|
68 |
+
author={Kim, Siun and Kang, Taegwan and Chung, Tae Kyu and Choi, Yoona and Hong, YeSol and Jung, Kyomin and Lee, Howard},
|
69 |
+
journal={Drug Safety},
|
70 |
+
volume={46},
|
71 |
+
pages={781--795},
|
72 |
+
year={2023}
|
73 |
+
}
|
config.json
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "monologg/kobert",
|
3 |
+
"architectures": [
|
4 |
+
"BertForPreTraining"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"gradient_checkpointing": false,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 512,
|
16 |
+
"model_type": "bert",
|
17 |
+
"num_attention_heads": 12,
|
18 |
+
"num_hidden_layers": 12,
|
19 |
+
"pad_token_id": 1,
|
20 |
+
"position_embedding_type": "absolute",
|
21 |
+
"torch_dtype": "float32",
|
22 |
+
"transformers_version": "4.31.0.dev0",
|
23 |
+
"type_vocab_size": 2,
|
24 |
+
"use_cache": true,
|
25 |
+
"vocab_size": 8002
|
26 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b00afd760a17c718872dffa81038a8f708672e4e132de711bf7b27d41e43cbea
|
3 |
+
size 371225993
|
special_tokens_map.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "[CLS]",
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"eos_token": "[SEP]",
|
5 |
+
"mask_token": {
|
6 |
+
"content": "[MASK]",
|
7 |
+
"lstrip": true,
|
8 |
+
"normalized": true,
|
9 |
+
"rstrip": false,
|
10 |
+
"single_word": false
|
11 |
+
},
|
12 |
+
"pad_token": "[PAD]",
|
13 |
+
"sep_token": "[SEP]",
|
14 |
+
"unk_token": "[UNK]"
|
15 |
+
}
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:17dc471055592d3cc9e0a5831e769246a8a001a4d27551c9ed79668173c7b407
|
3 |
+
size 371427
|
tokenizer_config.json
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": null,
|
3 |
+
"bos_token": "[CLS]",
|
4 |
+
"clean_up_tokenization_spaces": true,
|
5 |
+
"cls_token": "[CLS]",
|
6 |
+
"do_lower_case": false,
|
7 |
+
"eos_token": "[SEP]",
|
8 |
+
"keep_accents": false,
|
9 |
+
"mask_token": {
|
10 |
+
"__type": "AddedToken",
|
11 |
+
"content": "[MASK]",
|
12 |
+
"lstrip": true,
|
13 |
+
"normalized": true,
|
14 |
+
"rstrip": false,
|
15 |
+
"single_word": false
|
16 |
+
},
|
17 |
+
"model_max_length": 1000000000000000019884624838656,
|
18 |
+
"pad_token": "[PAD]",
|
19 |
+
"remove_space": true,
|
20 |
+
"sep_token": "[SEP]",
|
21 |
+
"sp_model_kwargs": {},
|
22 |
+
"tokenizer_class": "KoBERTTokenizer",
|
23 |
+
"unk_token": "[UNK]"
|
24 |
+
}
|