|
--- |
|
language: |
|
- fr |
|
tags: |
|
- music |
|
- rap |
|
- lyrics |
|
- word2vec |
|
library_name: gensim |
|
--- |
|
# Word2Bezbar: Word2Vec Models for French Rap Lyrics |
|
|
|
## Overview |
|
|
|
__Word2Bezbar__ are __Word2Vec__ models trained on __french rap lyrics__ sourced from __Genius__. Tokenization has been done using __NLTK__ french `word_tokenze` function, with a prior processing to remove __french oral contractions__. Used dataset size was __323MB__, corresponding to __77M tokens__. |
|
|
|
The model captures the __semantic relationships__ between words in the context of __french rap__, providing a useful tool for studies associated to __french slang__ and __music lyrics analysis__. |
|
|
|
## Model Details |
|
|
|
Size of this model is __medium__ |
|
|
|
| Parameter | Value | |
|
|----------------|--------------| |
|
| Dimensionality | 200 | |
|
| Window Size | 10 | |
|
| Epochs | 20 | |
|
| Algorithm | CBOW | |
|
|
|
## Versions |
|
|
|
This model has been trained with the followed software versions |
|
|
|
| Requirement | Version | |
|
|----------------|--------------| |
|
| Python | 3.8.5 | |
|
| Gensim library | 4.3.2 | |
|
| NTLK library | 3.8.1 | |
|
|
|
## Installation |
|
|
|
1. **Install Required Python Libraries**: |
|
|
|
```bash |
|
pip install gensim |
|
``` |
|
|
|
2. **Clone the Repository**: |
|
|
|
```bash |
|
git clone https://github.com/rapminerz/Word2Bezbar-medium.git |
|
``` |
|
|
|
3. **Navigate to the Model Directory**: |
|
|
|
```bash |
|
cd Word2Bezbar-medium |
|
``` |
|
|
|
## Loading the Model |
|
|
|
To load the Word2Bezbar Word2Vec model, use the following Python code: |
|
|
|
```python |
|
import gensim |
|
|
|
# Load the Word2Vec model |
|
model = gensim.models.Word2Vec.load("word2vec.model") |
|
``` |
|
|
|
## Using the Model |
|
|
|
Once the model is loaded, you can use it as shown: |
|
|
|
1. **To get the most similary words regarding a word** |
|
|
|
```python |
|
model.wv.most_similar("bendo") |
|
[('binks', 0.7833775877952576), |
|
('bando', 0.7511972188949585), |
|
('tieks', 0.7123318910598755), |
|
('ghetto', 0.6887569427490234), |
|
('hall', 0.679759681224823), |
|
('barrio', 0.6694452166557312), |
|
('hood', 0.6490002274513245), |
|
('block', 0.6299082040786743), |
|
('bloc', 0.627208411693573), |
|
('secteur', 0.6225507855415344)] |
|
|
|
model.wv.most_similar("kichta") |
|
[('liasse', 0.7877408266067505), |
|
('sse-lia', 0.7605615854263306), |
|
('kishta', 0.7043415904045105), |
|
('kich', 0.663270890712738), |
|
('sacoche', 0.6381840705871582), |
|
('moula', 0.6318666338920593), |
|
('valise', 0.5628494024276733), |
|
('bonbonne', 0.55326247215271), |
|
('skalape', 0.5523083806037903), |
|
('kichtas', 0.5385912656784058)] |
|
``` |
|
|
|
2. **To find the word that doesn't match in a list of words** |
|
|
|
```python |
|
model.wv.doesnt_match(["racli","gow","gadji","fimbi","boug"]) |
|
'boug' |
|
|
|
model.wv.doesnt_match(["Zidane","Mbappé","Ronaldo","Messi","Jordan"]) |
|
'Jordan' |
|
``` |
|
|
|
3. **To find the similarity between two words** |
|
|
|
```python |
|
model.wv.similarity("kichta", "moula") |
|
0.63186663 |
|
|
|
model.wv.similarity("bonheur", "moula") |
|
0.14551902 |
|
``` |
|
|
|
4. **Or even get the vector representation of a word** |
|
|
|
```python |
|
model.wv['ekip'] |
|
array([ 1.4757039e-01, ... 1.1260221e+00], |
|
dtype=float32) |
|
``` |
|
|
|
## Purpose and Disclaimer |
|
|
|
This model is designed for academic and research purposes only. It is not intended for commercial use. The creators of this model do not endorse or promote any specific views or opinions that may be represented in the dataset. |
|
|
|
__Please mention @RapMinerz if you use our models__ |
|
|
|
## Contact |
|
|
|
For any questions or issues, please contact the repository owner, __RapMinerz__, at rapminerz.contact@gmail.com. |