farnazzeidi commited on
Commit
6950fee
·
verified ·
1 Parent(s): c599cce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -5,4 +5,89 @@ language:
5
  base_model:
6
  - deepset/gbert-base
7
  pipeline_tag: token-classification
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - deepset/gbert-base
7
  pipeline_tag: token-classification
8
+ ---
9
+
10
+ # NER Model for Legal Texts
11
+
12
+ Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey.
13
+
14
+ In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.
15
+
16
+ ---
17
+
18
+ ## Overview
19
+ - **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648)
20
+ - **Architecture**: Optimized BERT Base
21
+ - **Language**: Turkish
22
+ - **Supported Labels**:
23
+ - `Person`
24
+ - `Law`
25
+ - `Publication`
26
+ - `Government`
27
+ - `Corporation`
28
+ - `Other`
29
+ - `Project`
30
+ - `Money`
31
+ - `Date`
32
+ - `Location`
33
+ - `Court`
34
+
35
+ **Model Name**: LegalLTurk Optimized BERT
36
+
37
+ ---
38
+
39
+ ## How to Use
40
+
41
+ ### Use a pipeline as a high-level helper
42
+ ```python
43
+ from transformers import pipeline
44
+
45
+ # Load the pipeline
46
+ model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')
47
+
48
+ # Input text
49
+ text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
50
+
51
+ # Get predictions
52
+ predictions = model(text)
53
+ print(predictions)
54
+ ```
55
+
56
+
57
+ ### Load model directly
58
+ ```python
59
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
60
+ import torch
61
+
62
+ # Load model and tokenizer
63
+
64
+ tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
65
+ model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
66
+
67
+ text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
68
+ inputs = tokenizer(text, return_tensors="pt")
69
+ outputs = model(**inputs)
70
+
71
+ # Process logits and map predictions to labels
72
+ predictions = [
73
+ (token, model.config.id2label[label.item()])
74
+ for token, label in zip(
75
+ tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
76
+ torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
77
+ )
78
+ if token not in tokenizer.all_special_tokens
79
+ ]
80
+
81
+ print(predictions)
82
+ ```
83
+ ---
84
+ # Authors
85
+ Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol
86
+
87
+ ---
88
+
89
+ ## License
90
+ This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
91
+ You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.
92
+
93
+ For commercial use, please contact [zeidi.uni@gmail.com].