wkaminski's picture
Update README.md
e3507b9
---
license: lgpl-3.0
base_model: sdadas/polish-roberta-base-v2
tags:
- generated_from_trainer
datasets:
- nkjp1m
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: polish-roberta-base-v2-pos-tagging
results:
- task:
name: Token Classification
type: token-classification
dataset:
name: nkjp1m
type: nkjp1m
config: nkjp1m
split: test
args: nkjp1m
metrics:
- name: Precision
type: precision
value: 0.9853198910270871
- name: Recall
type: recall
value: 0.9858245297268206
- name: F1
type: f1
value: 0.9855721457799069
- name: Accuracy
type: accuracy
value: 0.9884294612942691
widget:
- text: "Niosę dwa miedziane leje"
- text: "Ale dzisiaj leje"
language:
- pl
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# polish-roberta-base-v2-pos-tagging
This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0508
- Precision: 0.9853
- Recall: 0.9858
- F1: 0.9856
- Accuracy: 0.9884
You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning
## Usage
```
from transformers import pipeline
nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")
nlp("Ale dzisiaj leje")
```
## Model description
This model is a part-of-speech tagger for the Polish language based on [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2).
It support 40 classes representing flexemic class (detailed part of speech):
```
{
0: 'adj',
1: 'adja',
2: 'adjc',
3: 'adjp',
4: 'adv',
5: 'aglt',
6: 'bedzie',
7: 'brev',
8: 'comp',
9: 'conj',
10: 'depr',
11: 'dig',
12: 'fin',
13: 'frag',
14: 'ger',
15: 'imps',
16: 'impt',
17: 'inf',
18: 'interj',
19: 'interp',
20: 'num',
21: 'numcomp',
22: 'pact',
23: 'pacta',
24: 'pant',
25: 'part',
26: 'pcon',
27: 'ppas',
28: 'ppron12',
29: 'ppron3',
30: 'praet',
31: 'pred',
32: 'prep',
33: 'romandig',
34: 'siebie',
35: 'subst',
36: 'sym',
37: 'winien',
38: 'xxs',
39: 'xxx'
}
```
Tags meaning is the same as in nkjp1m dataset:
| flexeme | abbreviation | base form | example |
|----------------------------|--------------|---------------------------------------------------------|-----------------------|
| noun | subst | singular nominative | profesor |
| depreciative form | depr | singular nominative form of the corresponding noun | profesor |
| main numeral | num | inanimate masculine nominative form | pięć, dwa |
| collective numeral | numcol | inanimate masculine nominative form of the main numeral | pięć, dwa |
| adjective | adj | singular nominative masculine positive form | polski |
| ad-adjectival adjective | adja | singular nominative masculine positive form of the adjective | polski |
| post-prepositional adjective | adjp | singular nominative masculine positive form of the adjective | polski |
| predicative adjective | adjc | singular nominative masculine positive form of the adjective | zdrowy, ciekawy |
| adverb | adv | positive form | dobrze, bardzo |
| non-3rd person pronoun | ppron12 | singular nominative | ja |
| 3rd-person pronoun | ppron3 | singular nominative | on |
| pronoun siebie | siebie | accusative | siebie |
| non-past form | fin | infinitive | czytać |
| future być | bedzie | infinitive | być |
| agglutinate być | aglt | infinitive | być |
| l-participle | praet | infinitive | czytać |
| imperative | impt | infinitive | czytać |
| impersonal | imps | infinitive | czytać |
| infinitive | inf | infinitive | czytać |
| contemporary adv. participle | pcon | infinitive | czytać |
| anterior adv. participle | pant | infinitive | czytać |
| gerund | ger | infinitive | czytać |
| active adj. participle | pact | infinitive | czytać |
| passive adj. participle | ppas | infinitive | czytać |
| winien | winien | singular masculine form | powinien, rad |
| predicative | pred | the only form of that flexeme | warto |
| preposition | prep | the non-vocalic form of that flexeme | na, przez, w |
| coordinating conjunction | conj | the only form of that flexeme | oraz |
| subordinating conjunction | comp | the only form of that flexeme | że |
| particle-adverb | qub | the only form of that flexeme | nie, -że, się |
| abbreviation | brev | the full dictionary form | rok, i tak dalej |
| bound word | burk | the only form of that flexeme | trochu, oścież |
| interjection | interj | the only form of that flexeme | ech, kurde |
| punctuation | interp | the only form of that flexeme | ;, ., (, ] |
| alien | xxx | the only form of that flexeme | cool , nihil |
## Intended uses & limitations
Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.
## Training and evaluation data
Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
### Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.0665 | 1.0 | 2155 | 0.0629 | 0.9835 | 0.9836 | 0.9836 | 0.9867 |
| 0.0369 | 2.0 | 4310 | 0.0539 | 0.9842 | 0.9848 | 0.9845 | 0.9876 |
| 0.0243 | 3.0 | 6465 | 0.0508 | 0.9853 | 0.9858 | 0.9856 | 0.9884 |
### Framework versions
- Transformers 4.36.0
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0