File size: 3,174 Bytes
b98f371 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
language: pl
tags:
- fastText
datasets:
- kgr10
---
# KGR10 FastText Polish word embeddings
Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations):
- dimension: 100, 300
- method: skipgram, cbow
- tool: FastText, Magnitude
- source text: plain, plain.lower, plain.lemma, plain.lemma.lower
## Models
In the repository you can find 4 selected models, that were examined in the paper (see Citation).
A model that performed the best is the default model/config (see `default_config.json`).
## Usage
To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings).
```bash
pip install clarinpl-embeddings
```
### Utilising the default model (the easiest way)
Word embedding:
```python
from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
from flair.data import Sentence
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])
for token in sentence:
print(token)
print(token.embedding)
```
Document embedding (averaged over words):
```python
from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from flair.data import Sentence
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])
print(sentence.embedding)
```
### Customisable way
Word embedding:
```python
from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence
config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticWordEmbedding.from_config(config)
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])
for token in sentence:
print(token)
print(token.embedding)
```
Document embedding (averaged over words):
```python
from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence
config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticDocumentEmbedding.from_config(config)
sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])
print(sentence.embedding)
```
## Citation
The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article:
```
@article{kocon2018embeddings,
author = {Koco\'{n}, Jan and Gawor, Micha{\l}},
title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal
expressions using {BiLSTM-CRF}},
journal = {Schedae Informaticae},
volume = {27},
year = {2018},
url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/},
doi = {10.4467/20838476SI.18.008.10413}
}
```
|