File size: 2,766 Bytes
c599cce
 
 
 
 
 
 
6950fee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: agpl-3.0
language:
- de
base_model:
- deepset/gbert-base
pipeline_tag: token-classification
---

# NER Model for Legal Texts

Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey.

In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.

---

## Overview
- **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648)
- **Architecture**: Optimized BERT Base
- **Language**: Turkish
- **Supported Labels**: 
  - `Person`
  - `Law`
  - `Publication`
  - `Government`
  - `Corporation`
  - `Other`
  - `Project`
  - `Money`
  - `Date`
  - `Location`
  - `Court`

**Model Name**: LegalLTurk Optimized BERT

---

## How to Use

### Use a pipeline as a high-level helper
```python
from transformers import pipeline

# Load the pipeline
model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')

# Input text
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."

# Get predictions
predictions = model(text)
print(predictions)
```


### Load model directly
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")

text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Process logits and map predictions to labels
predictions = [
    (token, model.config.id2label[label.item()])
    for token, label in zip(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
        torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
    )
    if token not in tokenizer.all_special_tokens
]

print(predictions)
```
---
# Authors
Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol

---

## License
This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). 
You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.

For commercial use, please contact [zeidi.uni@gmail.com].