File size: 2,671 Bytes
a88f7ab
 
295f517
 
 
b0f74d6
 
 
 
a88f7ab
b0f74d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
tags:
- word2vec
- gensim
datasets:
- wikipedia
language:
- it
---
--------------------------------------------------------------------------------------------------

<body>
<span class="vertical-text" style="background-color:lightgreen;border-radius: 3px;padding: 3px;"></span>
<br>
<span class="vertical-text" style="background-color:orange;border-radius: 3px;padding: 3px;">  </span>
<br>
<span class="vertical-text" style="background-color:lightblue;border-radius: 3px;padding: 3px;">    Model: Word2Vec</span>
<br>
<span class="vertical-text" style="background-color:tomato;border-radius: 3px;padding: 3px;">    Lang: IT</span>
<br>
<span class="vertical-text" style="background-color:lightgrey;border-radius: 3px;padding: 3px;">  </span>
<br>
<span class="vertical-text" style="background-color:#CF9FFF;border-radius: 3px;padding: 3px;"></span>
</body>

--------------------------------------------------------------------------------------------------

<h3>Model description</h3>

This model is a <b>lightweight</b> and uncased version of <b>Word2Vec</b> <b>[1]</b> for the <b>italian</b> language. It's implemented in Gensim and it provides embeddings for 560.509 uncased italian words in a 100-dimensional vector space, resulting in a total model size of about 245 MB.


<h3>Training procedure</h3>

The model was trained on the italian split of the Wikipedia dataset (about 3.7GB, lowercased and pre-processed) for 10 epochs, using a window size of 5 and including words with a minimum count of 10, with an initial learning rate of 2.5e-3


<h3>Quick usage</h3>

Download the files in a local folder called "word2vec-light-uncased-it", then run:

```python
from gensim.models import KeyedVectors

model = KeyedVectors.load("./word2vec-light-uncased-it/word2vec.wordvectors", mmap='r')

model.most_similar("poesia", topn=5)
```

Expected output:

```
[('letteratura', 0.8193784356117249),
 ('poetica', 0.8115736246109009),
 ('narrativa', 0.7729100584983826),
 ('drammaturgia', 0.7576397061347961),
 ('prosa', 0.7552034854888916)]
```

<h3>Limitations</h3>

This lightweight model is trained on Wikipedia, so it's particularly suitable for natively digital text 
from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). 

However, it may show limitations when it comes to chaotic text, containing errors and slang expressions
(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).

<h3>References</h3>

[1] https://arxiv.org/abs/1301.3781

<h3>License</h3>

The model is released under <b>Apache-2.0</b> license