File size: 6,848 Bytes
3b32078 401e133 3b32078 41065be 3b32078 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: cc-by-4.0
---
# Clean ConceptNet Data for All Languages
## Data Details
For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).
The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.
We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.
### Dataset Structure
Each file is a txt file with a word / phrase and corresponding embedding separated with a space.
Use the following function to read in the embeddings:
```python
def read_embeddings_from_text(file_path, embedding_size=300):
"""Function to read the embeddings from a txt file"""
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
parts = line.strip().split(' ')
embedding_start_index = len(parts) - embedding_size
phrase = ' '.join(parts[:embedding_start_index])
embedding = np.array([float(val) for val in parts[embedding_start_index:]])
embeddings[phrase] = embedding
return embeddings
```
### Language Details
| Language Code | Language Name | Vocabulary Size|
| --- | --- | --- |
| af | Afrikaans | 12973 |
| sc | Sardinian | 573 |
| yo | Yoruba | 2283 |
| gn | Guarani | 131 |
| qu | Quechua | 5156 |
| li | Limburgish | 485 |
| ln | Lingala | 4109 |
| wo | Wolof | 1196 |
| zu | Zulu | 2758 |
| rm | Romansh | 3919 |
| ht | Haitian Creole | 2699 |
| su | Sundanese | 2514 |
| br | Breton | 11665 |
| gd | Scottish Gaelic | 14418 |
| xh | Xhosa | 2504 |
| mg | Malagasy | 26575 |
| jv | Javanese | 4919 |
| fy | Frisian | 7608 |
| sa | Sanskrit | 5789 |
| my | Burmese | 4875 |
| ug | Uyghur | 998 |
| yi | Yiddish | 8054 |
| or | Oriya | 109 |
| ha | Hausa | 802 |
| la | Latin | 848943 |
| sd | Sindhi | 143 |
| so | Somali | 593 |
| ku | Kurdish | 9737 |
| pa | Punjabi | 4488 |
| ps | Pashto | 1087 |
| ga | Irish | 29459 |
| am | Amharic | 1909 |
| km | Khmer | 3466 |
| uz | Uzbek | 5224 |
| ky | Kyrgyz | 3574 |
| cy | Welsh | 13243 |
| gu | Gujarati | 4427 |
| eo | Esperanto | 91074 |
| sw | Swahili | 9131 |
| mr | Marathi | 5545 |
| kn | Kannada | 3415 |
| ne | Nepali | 4224 |
| mn | Mongolian | 6740 |
| si | Sinhala | 2062 |
| te | Telugu | 18707 |
| be | Belarusian | 14871 |
| mk | Macedonian | 28935 |
| gl | Galician | 52824 |
| hy | Armenian | 23434 |
| is | Icelandic | 40287 |
| ml | Malayalam | 6750 |
| bn | Bengali | 7306 |
| ur | Urdu | 8476 |
| kk | Kazakh | 13700 |
| ka | Georgian | 25014 |
| az | Azerbaijani | 13277 |
| sq | Albanian | 16262 |
| ta | Tamil | 9064 |
| et | Estonian | 20088 |
| lv | Latvian | 30059 |
| ms | Malay | 88416 |
| sl | Slovenian | 89210 |
| lt | Lithuanian | 21184 |
| he | Hebrew | 27283 |
| sk | Slovak | 21657 |
| el | Greek | 39667 |
| th | Thai | 94281 |
| bg | Bulgarian | 171740 |
| da | Danish | 46600 |
| uk | Ukrainian | 27682 |
| ro | Romanian | 36206 |
### Licensing Information
This work includes data from ConceptNet 5, which was compiled by the
Commonsense Computing Initiative. ConceptNet 5 is freely available under
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
http://conceptnet.io.
### Citation Information
```
@misc{gurgurov2024lowremrepositorywordembeddings,
title={LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge},
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
year={2024},
eprint={2409.18193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.18193},
}
@paper{speer2017conceptnet,
author = {Robyn Speer and Joshua Chin and Catherine Havasi},
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
conference = {AAAI Conference on Artificial Intelligence},
year = {2017},
pages = {4444--4451},
keywords = {ConceptNet; knowledge graph; word embeddings},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}
``` |