Update README.md

e3507b9 about 1 year ago

8.47 kB

	---
	license: lgpl-3.0
	base_model: sdadas/polish-roberta-base-v2
	tags:
	- generated_from_trainer
	datasets:
	- nkjp1m
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: polish-roberta-base-v2-pos-tagging
	results:
	- task:
	name: Token Classification
	type: token-classification
	dataset:
	name: nkjp1m
	type: nkjp1m
	config: nkjp1m
	split: test
	args: nkjp1m
	metrics:
	- name: Precision
	type: precision
	value: 0.9853198910270871
	- name: Recall
	type: recall
	value: 0.9858245297268206
	- name: F1
	type: f1
	value: 0.9855721457799069
	- name: Accuracy
	type: accuracy
	value: 0.9884294612942691
	widget:
	- text: "Niosę dwa miedziane leje"
	- text: "Ale dzisiaj leje"
	language:
	- pl
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# polish-roberta-base-v2-pos-tagging

	This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0508
	- Precision: 0.9853
	- Recall: 0.9858
	- F1: 0.9856
	- Accuracy: 0.9884

	You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

	## Usage

	```
	from transformers import pipeline

	nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")

	nlp("Ale dzisiaj leje")
	```

	## Model description

	This model is a part-of-speech tagger for the Polish language based on [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2).

	It support 40 classes representing flexemic class (detailed part of speech):
	```
	{
	0: 'adj',
	1: 'adja',
	2: 'adjc',
	3: 'adjp',
	4: 'adv',
	5: 'aglt',
	6: 'bedzie',
	7: 'brev',
	8: 'comp',
	9: 'conj',
	10: 'depr',
	11: 'dig',
	12: 'fin',
	13: 'frag',
	14: 'ger',
	15: 'imps',
	16: 'impt',
	17: 'inf',
	18: 'interj',
	19: 'interp',
	20: 'num',
	21: 'numcomp',
	22: 'pact',
	23: 'pacta',
	24: 'pant',
	25: 'part',
	26: 'pcon',
	27: 'ppas',
	28: 'ppron12',
	29: 'ppron3',
	30: 'praet',
	31: 'pred',
	32: 'prep',
	33: 'romandig',
	34: 'siebie',
	35: 'subst',
	36: 'sym',
	37: 'winien',
	38: 'xxs',
	39: 'xxx'
	}
	```
	Tags meaning is the same as in nkjp1m dataset:

	\| flexeme \| abbreviation \| base form \| example \|
	\|----------------------------\|--------------\|---------------------------------------------------------\|-----------------------\|
	\| noun \| subst \| singular nominative \| profesor \|
	\| depreciative form \| depr \| singular nominative form of the corresponding noun \| profesor \|
	\| main numeral \| num \| inanimate masculine nominative form \| pięć, dwa \|
	\| collective numeral \| numcol \| inanimate masculine nominative form of the main numeral \| pięć, dwa \|
	\| adjective \| adj \| singular nominative masculine positive form \| polski \|
	\| ad-adjectival adjective \| adja \| singular nominative masculine positive form of the adjective \| polski \|
	\| post-prepositional adjective \| adjp \| singular nominative masculine positive form of the adjective \| polski \|
	\| predicative adjective \| adjc \| singular nominative masculine positive form of the adjective \| zdrowy, ciekawy \|
	\| adverb \| adv \| positive form \| dobrze, bardzo \|
	\| non-3rd person pronoun \| ppron12 \| singular nominative \| ja \|
	\| 3rd-person pronoun \| ppron3 \| singular nominative \| on \|
	\| pronoun siebie \| siebie \| accusative \| siebie \|
	\| non-past form \| fin \| infinitive \| czytać \|
	\| future być \| bedzie \| infinitive \| być \|
	\| agglutinate być \| aglt \| infinitive \| być \|
	\| l-participle \| praet \| infinitive \| czytać \|
	\| imperative \| impt \| infinitive \| czytać \|
	\| impersonal \| imps \| infinitive \| czytać \|
	\| infinitive \| inf \| infinitive \| czytać \|
	\| contemporary adv. participle \| pcon \| infinitive \| czytać \|
	\| anterior adv. participle \| pant \| infinitive \| czytać \|
	\| gerund \| ger \| infinitive \| czytać \|
	\| active adj. participle \| pact \| infinitive \| czytać \|
	\| passive adj. participle \| ppas \| infinitive \| czytać \|
	\| winien \| winien \| singular masculine form \| powinien, rad \|
	\| predicative \| pred \| the only form of that flexeme \| warto \|
	\| preposition \| prep \| the non-vocalic form of that flexeme \| na, przez, w \|
	\| coordinating conjunction \| conj \| the only form of that flexeme \| oraz \|
	\| subordinating conjunction \| comp \| the only form of that flexeme \| że \|
	\| particle-adverb \| qub \| the only form of that flexeme \| nie, -że, się \|
	\| abbreviation \| brev \| the full dictionary form \| rok, i tak dalej \|
	\| bound word \| burk \| the only form of that flexeme \| trochu, oścież \|
	\| interjection \| interj \| the only form of that flexeme \| ech, kurde \|
	\| punctuation \| interp \| the only form of that flexeme \| ;, ., (, ] \|
	\| alien \| xxx \| the only form of that flexeme \| cool , nihil \|

	## Intended uses & limitations

	Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

	## Training and evaluation data

	Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| 0.0665 \| 1.0 \| 2155 \| 0.0629 \| 0.9835 \| 0.9836 \| 0.9836 \| 0.9867 \|
	\| 0.0369 \| 2.0 \| 4310 \| 0.0539 \| 0.9842 \| 0.9848 \| 0.9845 \| 0.9876 \|
	\| 0.0243 \| 3.0 \| 6465 \| 0.0508 \| 0.9853 \| 0.9858 \| 0.9856 \| 0.9884 \|


	### Framework versions

	- Transformers 4.36.0
	- Pytorch 2.1.0+cu118
	- Datasets 2.15.0
	- Tokenizers 0.15.0