File size: 2,766 Bytes
c599cce 6950fee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
license: agpl-3.0
language:
- de
base_model:
- deepset/gbert-base
pipeline_tag: token-classification
---
# NER Model for Legal Texts
Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey.
In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.
---
## Overview
- **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648)
- **Architecture**: Optimized BERT Base
- **Language**: Turkish
- **Supported Labels**:
- `Person`
- `Law`
- `Publication`
- `Government`
- `Corporation`
- `Other`
- `Project`
- `Money`
- `Date`
- `Location`
- `Court`
**Model Name**: LegalLTurk Optimized BERT
---
## How to Use
### Use a pipeline as a high-level helper
```python
from transformers import pipeline
# Load the pipeline
model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')
# Input text
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
# Get predictions
predictions = model(text)
print(predictions)
```
### Load model directly
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Process logits and map predictions to labels
predictions = [
(token, model.config.id2label[label.item()])
for token, label in zip(
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
)
if token not in tokenizer.all_special_tokens
]
print(predictions)
```
---
# Authors
Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol
---
## License
This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.
For commercial use, please contact [zeidi.uni@gmail.com].
|