|
--- |
|
license: afl-3.0 |
|
tags: |
|
- chinese |
|
- ner |
|
- medical |
|
--- |
|
|
|
# 医疗领域中文命名实体识别 |
|
|
|
项目地址:https://github.com/iioSnail/chinese_medical_ner |
|
|
|
使用方法: |
|
|
|
``` |
|
from transformers import AutoModelForTokenClassification, BertTokenizerFast |
|
|
|
tokenizer = BertTokenizerFast.from_pretrained('iioSnail/bert-base-chinese-medical-ner') |
|
model = AutoModelForTokenClassification.from_pretrained("iioSnail/bert-base-chinese-medical-ner") |
|
|
|
sentences = ["瘦脸针、水光针和玻尿酸详解!", "半月板钙化的病因有哪些?"] |
|
inputs = tokenizer(sentences, return_tensors="pt", padding=True, add_special_tokens=False) |
|
outputs = model(**inputs) |
|
outputs = outputs.logits.argmax(-1) * inputs['attention_mask'] |
|
|
|
print(outputs) |
|
``` |
|
|
|
输出结果: |
|
|
|
``` |
|
tensor([[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 4, 4], |
|
[1, 2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 0, 0]]) |
|
``` |
|
|
|
其中 `1=B, 2=I, 3=E, 4=O`。`1, 3`表示一个二字医疗实体,`1,2,3`表示一个3字医疗实体, `1,2,2,3`表示一个4字医疗实体,依次类推。 |
|
|
|
可以使用项目中的`MedicalNerModel.format_outputs(sentences, outputs)`来将输出进行转换。 |
|
|
|
效果如下: |
|
|
|
``` |
|
[ |
|
[ |
|
{'start': 0, 'end': 3, 'word': '瘦脸针'}, |
|
{'start': 4, 'end': 7, 'word': '水光针'}, |
|
{'start': 8, 'end': 11, 'word': '玻尿酸'}、 |
|
], |
|
[ |
|
{'start': 0, 'end': 5, 'word': '半月板钙化'} |
|
] |
|
] |
|
``` |
|
|
|
更多信息请参考项目:https://github.com/iioSnail/chinese_medical_ner |