|
--- |
|
license: cc-by-4.0 |
|
--- |
|
|
|
# Clean ConceptNet Data for All Languages |
|
|
|
## Data Details |
|
|
|
For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz). |
|
|
|
The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words. |
|
|
|
We generate graph embeddings for 72 languages present in both CC100 and ConceptNet. |
|
|
|
### Dataset Structure |
|
|
|
Each file is a txt file with a word / phrase and corresponding embedding separated with a space. |
|
|
|
Use the following function to read in the embeddings: |
|
|
|
```python |
|
def read_embeddings_from_text(file_path, embedding_size=300): |
|
"""Function to read the embeddings from a txt file""" |
|
embeddings = {} |
|
with open(file_path, 'r', encoding='utf-8') as file: |
|
for line in file: |
|
parts = line.strip().split(' ') |
|
embedding_start_index = len(parts) - embedding_size |
|
phrase = ' '.join(parts[:embedding_start_index]) |
|
embedding = np.array([float(val) for val in parts[embedding_start_index:]]) |
|
embeddings[phrase] = embedding |
|
return embeddings |
|
``` |
|
|
|
### Language Details |
|
|
|
| Language Code | Language Name | Vocabulary Size| |
|
| --- | --- | --- | |
|
| af | Afrikaans | 12973 | |
|
| sc | Sardinian | 573 | |
|
| yo | Yoruba | 2283 | |
|
| gn | Guarani | 131 | |
|
| qu | Quechua | 5156 | |
|
| li | Limburgish | 485 | |
|
| ln | Lingala | 4109 | |
|
| wo | Wolof | 1196 | |
|
| zu | Zulu | 2758 | |
|
| rm | Romansh | 3919 | |
|
| ht | Haitian Creole | 2699 | |
|
| su | Sundanese | 2514 | |
|
| br | Breton | 11665 | |
|
| gd | Scottish Gaelic | 14418 | |
|
| xh | Xhosa | 2504 | |
|
| mg | Malagasy | 26575 | |
|
| jv | Javanese | 4919 | |
|
| fy | Frisian | 7608 | |
|
| sa | Sanskrit | 5789 | |
|
| my | Burmese | 4875 | |
|
| ug | Uyghur | 998 | |
|
| yi | Yiddish | 8054 | |
|
| or | Oriya | 109 | |
|
| ha | Hausa | 802 | |
|
| la | Latin | 848943 | |
|
| sd | Sindhi | 143 | |
|
| so | Somali | 593 | |
|
| ku | Kurdish | 9737 | |
|
| pa | Punjabi | 4488 | |
|
| ps | Pashto | 1087 | |
|
| ga | Irish | 29459 | |
|
| am | Amharic | 1909 | |
|
| km | Khmer | 3466 | |
|
| uz | Uzbek | 5224 | |
|
| ky | Kyrgyz | 3574 | |
|
| cy | Welsh | 13243 | |
|
| gu | Gujarati | 4427 | |
|
| eo | Esperanto | 91074 | |
|
| sw | Swahili | 9131 | |
|
| mr | Marathi | 5545 | |
|
| kn | Kannada | 3415 | |
|
| ne | Nepali | 4224 | |
|
| mn | Mongolian | 6740 | |
|
| si | Sinhala | 2062 | |
|
| te | Telugu | 18707 | |
|
| be | Belarusian | 14871 | |
|
| mk | Macedonian | 28935 | |
|
| gl | Galician | 52824 | |
|
| hy | Armenian | 23434 | |
|
| is | Icelandic | 40287 | |
|
| ml | Malayalam | 6750 | |
|
| bn | Bengali | 7306 | |
|
| ur | Urdu | 8476 | |
|
| kk | Kazakh | 13700 | |
|
| ka | Georgian | 25014 | |
|
| az | Azerbaijani | 13277 | |
|
| sq | Albanian | 16262 | |
|
| ta | Tamil | 9064 | |
|
| et | Estonian | 20088 | |
|
| lv | Latvian | 30059 | |
|
| ms | Malay | 88416 | |
|
| sl | Slovenian | 89210 | |
|
| lt | Lithuanian | 21184 | |
|
| he | Hebrew | 27283 | |
|
| sk | Slovak | 21657 | |
|
| el | Greek | 39667 | |
|
| th | Thai | 94281 | |
|
| bg | Bulgarian | 171740 | |
|
| da | Danish | 46600 | |
|
| uk | Ukrainian | 27682 | |
|
| ro | Romanian | 36206 | |
|
|
|
|
|
### Licensing Information |
|
|
|
This work includes data from ConceptNet 5, which was compiled by the |
|
Commonsense Computing Initiative. ConceptNet 5 is freely available under |
|
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from |
|
http://conceptnet.io. |
|
|
|
### Citation Information |
|
|
|
``` |
|
@misc{gurgurov2024lowremrepositorywordembeddings, |
|
title={LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge}, |
|
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann}, |
|
year={2024}, |
|
eprint={2409.18193}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.18193}, |
|
} |
|
|
|
@paper{speer2017conceptnet, |
|
author = {Robyn Speer and Joshua Chin and Catherine Havasi}, |
|
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge}, |
|
conference = {AAAI Conference on Artificial Intelligence}, |
|
year = {2017}, |
|
pages = {4444--4451}, |
|
keywords = {ConceptNet; knowledge graph; word embeddings}, |
|
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972} |
|
} |
|
``` |