pei-germany
/

MEDNER-de-fp-gbert

Token Classification

Model card Files Files and versions Community

MEDNER-de-fp-gbert / README.md

farnazzeidi's picture

Update README.md

6950fee verified about 2 months ago

|

2.77 kB

	---
	license: agpl-3.0
	language:
	- de
	base_model:
	- deepset/gbert-base
	pipeline_tag: token-classification
	---

	# NER Model for Legal Texts

	Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an optimized BERT architecture using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by NewMind, a legal tech company in Istanbul, Turkey.

	In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.

	---

	## Overview
	- Preprint Paper: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648)
	- Architecture: Optimized BERT Base
	- Language: Turkish
	- Supported Labels:
	- `Person`
	- `Law`
	- `Publication`
	- `Government`
	- `Corporation`
	- `Other`
	- `Project`
	- `Money`
	- `Date`
	- `Location`
	- `Court`

	Model Name: LegalLTurk Optimized BERT

	---

	## How to Use

	### Use a pipeline as a high-level helper
	```python
	from transformers import pipeline

	# Load the pipeline
	model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')

	# Input text
	text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."

	# Get predictions
	predictions = model(text)
	print(predictions)
	```


	### Load model directly
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer

	tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
	model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")

	text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# Process logits and map predictions to labels
	predictions = [
	(token, model.config.id2label[label.item()])
	for token, label in zip(
	tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
	torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
	)
	if token not in tokenizer.all_special_tokens
	]

	print(predictions)
	```
	---
	# Authors
	Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol

	---

	## License
	This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
	You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.

	For commercial use, please contact [zeidi.uni@gmail.com].