MEDNER-de-fp-gbert / README.md

Update README.md

3098a9a verified 8 months ago

4.2 kB

	---
	license: agpl-3.0
	language:
	- de
	base_model:
	- deepset/gbert-base
	pipeline_tag: token-classification
	---

	# MEDNER.DE: Medicinal Product Entity Recognition in German-Specific Contexts

	Released in December 2024, this is a German BERT language model further pretrained on `deepset/gbert-base` using a pharmacovigilance-related case summary corpus. The model has been fine-tuned for Named Entity Recognition (NER) tasks on an automatically annotated dataset to recognize medicinal products such as medications and vaccines.
	In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches


	---

	## Overview
	- Paper: [https://...
	- Architecture: MLM_based BERT Base
	- Language: German
	- Supported Labels: Medicinal Product


	Model Name: MEDNER.DE

	---

	## How to Use

	### Use a pipeline as a high-level helper
	```python
	from transformers import pipeline

	# Load the NER pipeline
	model = pipeline("ner", model="pei-germany/MEDNER-de-fp-gbert", aggregation_strategy="none")

	# Input text
	text = "Der Patient wurde mit AstraZeneca geimpft und nahm anschließend Ibuprofen, um das Fieber zu senken."

	# Get raw predictions and merge subwords
	merged_predictions = []
	current = None

	for pred in model(text):
	if pred['word'].startswith("##"):
	if current:
	current['word'] += pred['word'][2:]
	current['end'] = pred['end']
	current['score'] = (current['score'] + pred['score']) / 2
	else:
	if current:
	merged_predictions.append(current)
	current = pred.copy()

	if current:
	merged_predictions.append(current)

	# Filter by confidence threshold and print
	threshold = 0.5
	filtered_predictions = [p for p in merged_predictions if p['score'] >= threshold]
	for p in filtered_predictions:
	print(f"Entity: {p['entity']}, Word: {p['word']}, Score: {p['score']:.2f}, Start: {p['start']}, End: {p['end']}")

	```


	### Load model directly
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("pei-germany/MEDNER-de-fp-gbert")
	model = AutoModelForTokenClassification.from_pretrained("pei-germany/MEDNER-de-fp-gbert")

	text = "Der Patient wurde mit AstraZeneca geimpft und nahm anschließend Ibuprofen, um das Fieber zu senken."

	# Tokenize and get predictions
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# Decode tokens and predictions
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	predictions = torch.argmax(outputs.logits, dim=2)[0].tolist()
	labels = [model.config.id2label[pred] for pred in predictions]

	# Process and merge subwords
	entities = []
	current_word = ""
	current_entity = None

	for token, label in zip(tokens, labels):
	token = token.replace("##", "") # Remove subword markers

	if label.startswith("B-"): # Beginning of a new entity
	if current_entity and current_entity == label[2:]: # Merge consecutive B- labels
	current_word += token
	else: # Save the previous entity and start a new one
	if current_word:
	entities.append({"entity": current_entity, "word": current_word})
	current_word = token
	current_entity = label[2:]
	elif label.startswith("I-") and current_entity == label[2:]: # Continuation of the same entity
	current_word += token
	else: # Outside any entity
	if current_word: # Save the previous entity
	entities.append({"entity": current_entity, "word": current_word})
	current_word = ""
	current_entity = None

	if current_word: # Append the last entity
	entities.append({"entity": current_entity, "word": current_word})

	# Print results
	for entity in entities:
	print(f"Entity: {entity['entity']}, Word: {entity['word']}")

	```
	---
	# Authors
	Farnaz Zeidi, Manuela Messelhäußer, Roman Christof, Xing David Wang, Ulf Leser, Dirk Mentzer, Renate König, Liam Childs.


	---

	## License
	This model is shared under the [GNU Affero General Public License v3.0 License](https://choosealicense.com/licenses/agpl-3.0/).