system HF staff commited on
Commit
b98f371
0 Parent(s):

add models

Browse files
.gitattributes ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pl
3
+ tags:
4
+ - fastText
5
+ datasets:
6
+ - kgr10
7
+ ---
8
+
9
+ # KGR10 FastText Polish word embeddings
10
+
11
+ Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations):
12
+ - dimension: 100, 300
13
+ - method: skipgram, cbow
14
+ - tool: FastText, Magnitude
15
+ - source text: plain, plain.lower, plain.lemma, plain.lemma.lower
16
+
17
+ ## Models
18
+
19
+ In the repository you can find 4 selected models, that were examined in the paper (see Citation).
20
+ A model that performed the best is the default model/config (see `default_config.json`).
21
+
22
+ ## Usage
23
+
24
+ To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings).
25
+
26
+ ```bash
27
+ pip install clarinpl-embeddings
28
+ ```
29
+
30
+ ### Utilising the default model (the easiest way)
31
+
32
+ Word embedding:
33
+
34
+ ```python
35
+ from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
36
+ from flair.data import Sentence
37
+
38
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
39
+
40
+ embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10")
41
+ embedding.embed([sentence])
42
+
43
+ for token in sentence:
44
+ print(token)
45
+ print(token.embedding)
46
+ ```
47
+
48
+ Document embedding (averaged over words):
49
+
50
+ ```python
51
+ from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
52
+ from flair.data import Sentence
53
+
54
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
55
+
56
+ embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10")
57
+ embedding.embed([sentence])
58
+
59
+ print(sentence.embedding)
60
+ ```
61
+
62
+ ### Customisable way
63
+
64
+ Word embedding:
65
+
66
+ ```python
67
+ from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
68
+ from embeddings.embedding.static.fasttext import KGR10FastTextConfig
69
+ from flair.data import Sentence
70
+
71
+ config = KGR10FastTextConfig(method='cbow', dimension=100)
72
+ embedding = AutoStaticWordEmbedding.from_config(config)
73
+
74
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
75
+ embedding.embed([sentence])
76
+
77
+ for token in sentence:
78
+ print(token)
79
+ print(token.embedding)
80
+ ```
81
+
82
+ Document embedding (averaged over words):
83
+
84
+ ```python
85
+ from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
86
+ from embeddings.embedding.static.fasttext import KGR10FastTextConfig
87
+ from flair.data import Sentence
88
+
89
+ config = KGR10FastTextConfig(method='cbow', dimension=100)
90
+ embedding = AutoStaticDocumentEmbedding.from_config(config)
91
+
92
+ sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
93
+ embedding.embed([sentence])
94
+
95
+ print(sentence.embedding)
96
+ ```
97
+
98
+
99
+ ## Citation
100
+
101
+ The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article:
102
+
103
+ ```
104
+ @article{kocon2018embeddings,
105
+ author = {Koco\'{n}, Jan and Gawor, Micha{\l}},
106
+ title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal
107
+ expressions using {BiLSTM-CRF}},
108
+ journal = {Schedae Informaticae},
109
+ volume = {27},
110
+ year = {2018},
111
+ url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/},
112
+ doi = {10.4467/20838476SI.18.008.10413}
113
+ }
114
+ ```
default_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "method": "skipgram",
3
+ "dimension": 300
4
+ }
kgr10.plain.cbow.dim100.neg10.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43a99e3ba9f91e50d82d1fa30d78fa2b8663bae2571a8ceb60f1b86c9fe587c4
3
+ size 3657638661
kgr10.plain.cbow.dim300.neg10.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1dfe6ad69103ce48d2b5f24d4ad3ba6771b8a892ea2b01f0847fc367d64add6
3
+ size 10839393861
kgr10.plain.skipgram.dim100.neg10.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56a7b5bb1eb817ccf6dc229988913af9cfb6fd1ca1d8ae331fd7e07d7a0e2c62
3
+ size 3657638661
kgr10.plain.skipgram.dim300.neg10.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:664fdeba47a5aa694edc2360ad89deeb069c7ef1875d7d3782205c37a2dcc072
3
+ size 10839393861
module.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "type": "embeddings.embedding.static.fasttext.KGR10FastTextEmbedding"
3
+ }
test/dummy.model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c49c8b8e85de626b44d1c170b6d66f7d7e6fdf0b692039959e9e7e97b5cad2e9
3
+ size 90705