|
--- |
|
license: agpl-3.0 |
|
language: |
|
- de |
|
base_model: |
|
- deepset/gbert-base |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# NER Model for Legal Texts |
|
|
|
Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey. |
|
|
|
In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches. |
|
|
|
--- |
|
|
|
## Overview |
|
- **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648) |
|
- **Architecture**: Optimized BERT Base |
|
- **Language**: Turkish |
|
- **Supported Labels**: |
|
- `Person` |
|
- `Law` |
|
- `Publication` |
|
- `Government` |
|
- `Corporation` |
|
- `Other` |
|
- `Project` |
|
- `Money` |
|
- `Date` |
|
- `Location` |
|
- `Court` |
|
|
|
**Model Name**: LegalLTurk Optimized BERT |
|
|
|
--- |
|
|
|
## How to Use |
|
|
|
### Use a pipeline as a high-level helper |
|
```python |
|
from transformers import pipeline |
|
|
|
# Load the pipeline |
|
model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple') |
|
|
|
# Input text |
|
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir." |
|
|
|
# Get predictions |
|
predictions = model(text) |
|
print(predictions) |
|
``` |
|
|
|
|
|
### Load model directly |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
import torch |
|
|
|
# Load model and tokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model") |
|
model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model") |
|
|
|
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
# Process logits and map predictions to labels |
|
predictions = [ |
|
(token, model.config.id2label[label.item()]) |
|
for token, label in zip( |
|
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]), |
|
torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0] |
|
) |
|
if token not in tokenizer.all_special_tokens |
|
] |
|
|
|
print(predictions) |
|
``` |
|
--- |
|
# Authors |
|
Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol |
|
|
|
--- |
|
|
|
## License |
|
This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). |
|
You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors. |
|
|
|
For commercial use, please contact [zeidi.uni@gmail.com]. |
|
|