amanpatkar commited on
Commit
dd58066
·
verified ·
1 Parent(s): da74d95

amanpatkar/distilbert-finetuned-ner

Browse files
README.md CHANGED
@@ -1,130 +1,92 @@
1
- ---
2
- base_model: distilbert-base-cased
3
- datasets:
4
- - conll2003
5
- license: mit
6
- metrics:
7
- - precision
8
- - recall
9
- - f1
10
- - accuracy
11
- tags:
12
- - generated_from_trainer
13
- model-index:
14
- - name: distilbert-finetuned-ner
15
- results:
16
- - task:
17
- type: token-classification
18
- name: Token Classification
19
- dataset:
20
- name: conll2003
21
- type: conll2003
22
- config: conll2003
23
- split: validation
24
- args: conll2003
25
- metrics:
26
- - type: precision
27
- value: 1
28
- name: Precision
29
- - type: recall
30
- value: 1
31
- name: Recall
32
- - type: f1
33
- value: 1
34
- name: F1
35
- - type: accuracy
36
- value: 1
37
- name: Accuracy
38
- ---
39
-
40
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
41
- should probably proofread and complete it, then remove this comment. -->
42
-
43
- # distilbert-finetuned-ner
44
-
45
- This model is a fine-tuned version of [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) on the conll2003 dataset.
46
- It achieves the following results on the evaluation set:
47
- - Loss: 0.0711
48
- - Precision: 1.0
49
- - Recall: 1.0
50
- - F1: 1.0
51
- - Accuracy: 1.0
52
-
53
- ## Model description
54
-
55
- The distilbert-finetuned-ner model is designed for Named Entity Recognition (NER) tasks. It is based on the DistilBERT architecture, which is a smaller, faster, and lighter version of BERT. DistilBERT retains 97% of BERT's language understanding while being 60% faster and 40% smaller, making it efficient for deployment in production systems.
56
-
57
- ## Intended Uses & Limitations
58
-
59
- #### How to use
60
-
61
- You can use this model with Transformers *pipeline* for NER.
62
-
63
- ```python
64
- from transformers import AutoTokenizer, AutoModelForTokenClassification
65
- from transformers import pipeline
66
-
67
- tokenizer = AutoTokenizer.from_pretrained("amanpatkar/distilbert-finetuned-ner")
68
- model = AutoModelForTokenClassification.from_pretrained("amanpatkar/distilbert-finetuned-ner")
69
-
70
- nlp = pipeline("ner", model=model, tokenizer=tokenizer)
71
- example = "My name is Aman Patkar and I live in Gurugram, India."
72
-
73
- ner_results = nlp(example)
74
- print(ner_results)
75
- ```
76
-
77
- ### Intended Uses
78
- - Named Entity Recognition (NER): Extracting entities such as names, locations, organizations, and miscellaneous entities from text.
79
- - Information Extraction: Automatically identifying and classifying key information in documents.
80
- - Text Preprocessing: Enhancing text preprocessing for downstream tasks like sentiment analysis and text summarization.
81
-
82
- ### limitations
83
- - Domain Specificity: The model is trained on the CoNLL-2003 dataset, which primarily consists of newswire data. Performance may degrade on text from different domains.
84
- - Language Limitation: This model is trained on English text. It may not perform well on text in other languages.
85
- - Precision in Complex Sentences: While the model performs well on standard sentences, complex sentence structures or ambiguous contexts might pose challenges.
86
-
87
-
88
- ## Training and evaluation data
89
-
90
- The model is fine-tuned on the CoNLL-2003 dataset, a widely-used dataset for training and evaluating NER systems. The dataset includes four types of named entities: Persons (PER), Organizations (ORG), Locations (LOC), and Miscellaneous (MISC).
91
- Abbreviation|Description
92
- -|-
93
- O|Outside of a named entity
94
- B-MISC |Beginning of a miscellaneous entity right after another miscellaneous entity
95
- I-MISC | Miscellaneous entity
96
- B-PER |Beginning of a person’s name right after another person’s name
97
- I-PER |Person’s name
98
- B-ORG |Beginning of an organization right after another organization
99
- I-ORG |organization
100
- B-LOC |Beginning of a location right after another location
101
- I-LOC |Location
102
-
103
- ## Training procedure
104
-
105
- ### Training hyperparameters
106
-
107
- The following hyperparameters were used during training:
108
- - learning_rate: 2e-05
109
- - train_batch_size: 8
110
- - eval_batch_size: 8
111
- - seed: 42
112
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
113
- - lr_scheduler_type: linear
114
- - num_epochs: 3
115
-
116
- ### Training results
117
-
118
- | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
119
- |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:---:|:--------:|
120
- | 0.0908 | 1.0 | 1756 | 0.0887 | 1.0 | 1.0 | 1.0 | 1.0 |
121
- | 0.0467 | 2.0 | 3512 | 0.0713 | 1.0 | 1.0 | 1.0 | 1.0 |
122
- | 0.0276 | 3.0 | 5268 | 0.0711 | 1.0 | 1.0 | 1.0 | 1.0 |
123
-
124
-
125
- ### Framework versions
126
-
127
- - Transformers 4.41.2
128
- - Pytorch 2.3.1
129
- - Datasets 2.20.0
130
- - Tokenizers 0.19.1
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: distilbert-base-cased
4
+ tags:
5
+ - generated_from_trainer
6
+ datasets:
7
+ - conll2003
8
+ metrics:
9
+ - precision
10
+ - recall
11
+ - f1
12
+ - accuracy
13
+ model-index:
14
+ - name: distilbert-finetuned-ner
15
+ results:
16
+ - task:
17
+ name: Token Classification
18
+ type: token-classification
19
+ dataset:
20
+ name: conll2003
21
+ type: conll2003
22
+ config: conll2003
23
+ split: validation
24
+ args: conll2003
25
+ metrics:
26
+ - name: Precision
27
+ type: precision
28
+ value: 1.0
29
+ - name: Recall
30
+ type: recall
31
+ value: 1.0
32
+ - name: F1
33
+ type: f1
34
+ value: 1.0
35
+ - name: Accuracy
36
+ type: accuracy
37
+ value: 1.0
38
+ ---
39
+
40
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
41
+ should probably proofread and complete it, then remove this comment. -->
42
+
43
+ # distilbert-finetuned-ner
44
+
45
+ This model is a fine-tuned version of [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) on the conll2003 dataset.
46
+ It achieves the following results on the evaluation set:
47
+ - Loss: 0.0736
48
+ - Precision: 1.0
49
+ - Recall: 1.0
50
+ - F1: 1.0
51
+ - Accuracy: 1.0
52
+
53
+ ## Model description
54
+
55
+ More information needed
56
+
57
+ ## Intended uses & limitations
58
+
59
+ More information needed
60
+
61
+ ## Training and evaluation data
62
+
63
+ More information needed
64
+
65
+ ## Training procedure
66
+
67
+ ### Training hyperparameters
68
+
69
+ The following hyperparameters were used during training:
70
+ - learning_rate: 2e-05
71
+ - train_batch_size: 8
72
+ - eval_batch_size: 8
73
+ - seed: 42
74
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
75
+ - lr_scheduler_type: linear
76
+ - num_epochs: 3
77
+
78
+ ### Training results
79
+
80
+ | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
81
+ |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:---:|:--------:|
82
+ | 0.0911 | 1.0 | 1756 | 0.0875 | 1.0 | 1.0 | 1.0 | 1.0 |
83
+ | 0.0469 | 2.0 | 3512 | 0.0736 | 1.0 | 1.0 | 1.0 | 1.0 |
84
+ | 0.0284 | 3.0 | 5268 | 0.0736 | 1.0 | 1.0 | 1.0 | 1.0 |
85
+
86
+
87
+ ### Framework versions
88
+
89
+ - Transformers 4.41.2
90
+ - Pytorch 2.3.1
91
+ - Datasets 2.20.0
92
+ - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c9ec6e7e3fbf78d8672ebd401b45e8f1191e89f8691f47f4ad5534abcc5f4933
3
  size 260803668
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a31a661dbb6ef9fdb0f2dab1d56359ee4a82513732017424ad61f20704c6629e
3
  size 260803668
special_tokens_map.json CHANGED
@@ -1,24 +1,7 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "</s>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": "</s>",
17
- "unk_token": {
18
- "content": "<unk>",
19
- "lstrip": false,
20
- "normalized": false,
21
- "rstrip": false,
22
- "single_word": false
23
- }
24
- }
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,41 +1,55 @@
1
- {
2
- "add_bos_token": true,
3
- "add_eos_token": false,
4
- "added_tokens_decoder": {
5
- "0": {
6
- "content": "<unk>",
7
- "lstrip": false,
8
- "normalized": false,
9
- "rstrip": false,
10
- "single_word": false,
11
- "special": true
12
- },
13
- "1": {
14
- "content": "<s>",
15
- "lstrip": false,
16
- "normalized": false,
17
- "rstrip": false,
18
- "single_word": false,
19
- "special": true
20
- },
21
- "2": {
22
- "content": "</s>",
23
- "lstrip": false,
24
- "normalized": false,
25
- "rstrip": false,
26
- "single_word": false,
27
- "special": true
28
- }
29
- },
30
- "bos_token": "<s>",
31
- "clean_up_tokenization_spaces": false,
32
- "eos_token": "</s>",
33
- "legacy": false,
34
- "model_max_length": 1000000000000000019884624838656,
35
- "pad_token": "</s>",
36
- "padding_side": "right",
37
- "sp_model_kwargs": {},
38
- "tokenizer_class": "LlamaTokenizer",
39
- "unk_token": "<unk>",
40
- "use_default_system_prompt": false
41
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:85d654163bf874c3aae09dc5148f58f0a480ae890d30bd869d0a1a024c4c43ff
3
  size 5112
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3629620f90bdb934281b15a8b1171df6f5479ad09e884c16dd4b83669105f31a
3
  size 5112