Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
14.2
TFLOPS
15
4
9
Omar Kamali
PRO
omarkamali
Follow
prafulshah717's profile picture
MoNssar's profile picture
aiproje's profile picture
56 followers
Β·
21 following
https://omarkama.li
omarkamali
omarkamali
omar-kamali
AI & ML interests
NLP & LLMs for low resource languages.
Recent Activity
updated
a dataset
about 13 hours ago
omarkamali/wikipedia-monthly
posted
an
update
3 days ago
You're probably training on outdated Wikipedia data right now and don't know it. π‘ In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace." He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time. β’ For English, that's 700,000 missing articles. β’ For Moroccan Arabic, 30% of the language's entire Wikipedia. β’ For 31 other languages, there was literally no text corpus at all until recently. I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works). Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on. Here's the full story of how I built Wikipedia Monthly π https://omarkamali.com/blog/wikipedia-monthly-pipeline
updated
a model
6 days ago
wikilangs/hu
View all activity
Organizations
omarkamali
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
upvoted
a
collection
8 months ago
Text Datasets
Collection
25 items
β’
Updated
30 days ago
β’
3
upvoted
an
article
over 1 year ago
view article
Article
Finding Moroccan Arabic (Darija) in Fineweb 2
Dec 8, 2024
β’
23
Load more