farnazzeidi
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,89 @@ language:
|
|
5 |
base_model:
|
6 |
- deepset/gbert-base
|
7 |
pipeline_tag: token-classification
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
base_model:
|
6 |
- deepset/gbert-base
|
7 |
pipeline_tag: token-classification
|
8 |
+
---
|
9 |
+
|
10 |
+
# NER Model for Legal Texts
|
11 |
+
|
12 |
+
Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey.
|
13 |
+
|
14 |
+
In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
## Overview
|
19 |
+
- **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648)
|
20 |
+
- **Architecture**: Optimized BERT Base
|
21 |
+
- **Language**: Turkish
|
22 |
+
- **Supported Labels**:
|
23 |
+
- `Person`
|
24 |
+
- `Law`
|
25 |
+
- `Publication`
|
26 |
+
- `Government`
|
27 |
+
- `Corporation`
|
28 |
+
- `Other`
|
29 |
+
- `Project`
|
30 |
+
- `Money`
|
31 |
+
- `Date`
|
32 |
+
- `Location`
|
33 |
+
- `Court`
|
34 |
+
|
35 |
+
**Model Name**: LegalLTurk Optimized BERT
|
36 |
+
|
37 |
+
---
|
38 |
+
|
39 |
+
## How to Use
|
40 |
+
|
41 |
+
### Use a pipeline as a high-level helper
|
42 |
+
```python
|
43 |
+
from transformers import pipeline
|
44 |
+
|
45 |
+
# Load the pipeline
|
46 |
+
model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')
|
47 |
+
|
48 |
+
# Input text
|
49 |
+
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
|
50 |
+
|
51 |
+
# Get predictions
|
52 |
+
predictions = model(text)
|
53 |
+
print(predictions)
|
54 |
+
```
|
55 |
+
|
56 |
+
|
57 |
+
### Load model directly
|
58 |
+
```python
|
59 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
60 |
+
import torch
|
61 |
+
|
62 |
+
# Load model and tokenizer
|
63 |
+
|
64 |
+
tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
|
65 |
+
model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
|
66 |
+
|
67 |
+
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
|
68 |
+
inputs = tokenizer(text, return_tensors="pt")
|
69 |
+
outputs = model(**inputs)
|
70 |
+
|
71 |
+
# Process logits and map predictions to labels
|
72 |
+
predictions = [
|
73 |
+
(token, model.config.id2label[label.item()])
|
74 |
+
for token, label in zip(
|
75 |
+
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
|
76 |
+
torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
|
77 |
+
)
|
78 |
+
if token not in tokenizer.all_special_tokens
|
79 |
+
]
|
80 |
+
|
81 |
+
print(predictions)
|
82 |
+
```
|
83 |
+
---
|
84 |
+
# Authors
|
85 |
+
Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol
|
86 |
+
|
87 |
+
---
|
88 |
+
|
89 |
+
## License
|
90 |
+
This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
|
91 |
+
You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.
|
92 |
+
|
93 |
+
For commercial use, please contact [zeidi.uni@gmail.com].
|