Finding Moroccan Arabic (Darija) in Fineweb 2
tl;dr: We used Gherbal to clean up the Fineweb 2 dataset and found that it is a good resource for training and evaluating models for Moroccan Arabic, despite the small sample size. We also performed an analysis on the websites that were used to source the data, found where Moroccan Arabic is used on the web, and that the data is quite noisy.
Following up on their previous release, the Fineweb team has been hard at work on the Fineweb 2 dataset which contains a massive collection of 50M+ sentences across 100+ languages. The data, sourced from the Common Crawl corpus, has been classified into these languages using GlotLID, a model able to recognize more than 2000 languages. The performance of GlotLID is quite impressive, considering the complexity of the task. It is able to correctly identify the language of a sentence with a decent degree of accuracy. However, especially for low-resource languages, it still makes some mistakes, and some languages are more difficult for it to identify than others.
This caught our interest and we wanted to see if we could help improve the quality of the dataset using our Gherbal language identification model, which we recently released and made available on our API platform. Our performance on several low-resource languages, notably Moroccan, Persian and Swahili, is quite impressive, and the hope is to expand the available resources for these underserved communities.
We gladly took on the challenge and as a first step, we chose to focus on Moroccan Arabic (Darija), a language spoken by millions of people in Morocco and people of Moroccan descent in Europe. This report will detail the process we went through to achieve our goal, and the results we obtained.
Table of Contents
- The Dataset
- Our Approach
- Dataset Processing
- Dataset Analysis
- Website Analysis
- Our Results
- False Positives
- False Negatives
- Other dataset configurations
- Arabic (arb_Arab)
- Arabic in Latin script (arb_Latn)
- Peering into the data
- Modeling the data
- Markov chain
- SmolLM
- Short words
- Long words
- Modeling the data
- Looking at the sources
- Some basic statistics
- Dataset Summary
- Website Longevity
- Content Metrics
- Notable Sites
- Quality Metrics
- Website Distribution & Activity
- Content Creation Over Time
- Geographic Distribution of Hosting and Domains
- Hosting
- Domains
- Topic Analysis
- Food and Culinary Content
- Personal and Emotional Content
- Religious Content
- National Identity and Sports
- Some basic statistics
- Reproducing the results
- Team
- Conclusion
The Dataset
The dataset we were given is a bunch of parquet files containing 50M+ sentences, with the following columns:
id
: a unique identifier for the documenttext
: the document itself, extracted from a webpagemetadata
: a json column containing metadata about the sentence, including the url of the page it was found on, the date of the page, and the previous classification of the page by GlotLID
The dataset contains several configurations, each one corresponding to a different language. We focused our attention on the Arabic arb_Arab_dedup
and the Moroccan ary_Arab_dedup
configurations, which we will refer to as the "Arabic" and "Moroccan" datasets respectively.
Our Approach
Dataset Processing
To tackle this challenge, we developed a systematic pipeline to process and analyze each dataset. First, we performed thorough text cleanup to remove any unwanted artifacts and standardize the content. This ensures we're working with clean, natural language text, especially as webpage content can be quite noisy.
Next, we leveraged the Natural Language Toolkit (NLTK) library to break down the documents into individual sentences. While this is not a perfect solution and the noisy content can make it difficult to identify sentences particulary for languages not supported by NLTK, it is a good enough approximation for our purposes, which is to reduce the variance in a webpage and avoid confusing the model with extremely long mixed-language content. This step was crucial as it allowed us to analyze the text at a more granular level.
With our sentences prepared, we then ran each one through our Gherbal language detection model. The model evaluated each sentence and provided a confidence score across the 33 languages Gherbal supports. We then aggregated these sentence-level results by averaging the classification scores. This gave us a comprehensive understanding of the dominant language patterns within each document. A more fine-grained analysis at the sentence level would have yielded more data with higher quality, but ultimately postponed to a later release given the time and resource constraints.
Finally, we applied a filtering step to focus specifically on content classified as Moroccan Arabic in Arabic script (ary_Arab
). The resulting dataset is available on Huggingface at sawalni-ai/fw-darija.
Dataset Analysis
We used our Klimat library to analyze the dataset. Klimat is a tool we developed to perform statistical analysis on language datasets, and is able to generate a number of interesting insights into the data. We will share more about Klimat in a future blog post, but for now we will focus on the results we obtained for the fineweb dataset.
Website Analysis
We also performed an analysis on the websites that were used to source the data in Fineweb 2, and classified by Gherbal as Moroccan Arabic. This gave us an interesting insight on where Moroccan Arabic is used on the web, which could be useful to increase the quantity of high quality data for the language. We broke down the data by multiple criteria, including the top level domain, the duration the website was online (based on Common Crawl accessing it), and more.
We restricted the analysis to high confidence samples, and filtered to the top 1000 websites by quantity of data.
Our Results
Let's start by looking at the results for the Moroccan dataset.
- Original count in ary_Arab: 5.8M
- Resulting count after filtering: 37352 (0.64% of original)
- Number of tokens in ary_Arab: 2.8B (estimated using tiktoken for multilinguality)
- Number of tokens in filtered dataset: 75.3M
False Positives
A manual review of the filtered dataset revealed that human preferences were consistent with the results of Gherbal, and that the filtered dataset should be a good resource for training and evaluating models for Moroccan, despite the small sample size. It is worth noting that Algerian and Tunisian Arabic were also misclassified as Moroccan Arabic due to the elevated mutual intelligibility between the three. This is a known current limitation of Gherbal which only supports Moroccan and Egyptian varieties of Arabic and should be addressed in future releases.
False Negatives
Looking at our Gherbal paper (pending publication), specifically at the benchmark results on the flores-200 devtest set, we can estimate that the false negative rate from Standard Arabic (arb_Arab) to Moroccan Arabic (ary_Arab) is around 10%. Extrapolating this figure, we can estimate the false negative rate for the filtered dataset to be around 37352 * 0.1 = 3735 Moroccan Arabic sentences that were incorrectly filtered out.
Other dataset configurations
We also applied the same process to the other dataset configurations, namely the Arabic (arb_Arab) and the latin-script Arabic (arb_Latn) configurations. While the results are not yet complete, we can already draw some interesting observations:
Arabic (arb_Arab)
- Original count in arb_Arab: 24.2M
- Resulting count after filtering: 0
While some samples (<100 in total) were classified as Moroccan Arabic, a manual review revealed that these were all incorrect classifications by Gherbal and that the filtered dataset is indeed empty. This might change as we process the rest of the dataset, or as we improve Gherbal's performance on Arabic and its related languages. The resulting dataset will be made available as an additional configuration on the same dataset here when the processing is complete.
Arabic in Latin script (arb_Latn)
- Original count in arb_Latn: 600K
- Resulting count after filtering: 15K (2.5% of original)
This dataset is classified as arb_Latn by GlotLID, which presents extreme variance in the data as Arabic can be transliterated in so many different ways. As Gherbal is able to correctly identify ary_Latn (Moroccan Arabic in Latin script) with a high degree of accuracy, we are able to recover a significant amount of data that was previously quite unusable among all the noise. We also observe that this dataset contains the most variations in the actual language as classified by Gherbal, which confirms that the arb_Latn
label from GlotLID is not a good proxy for high quality Arabic data in latin script. The resulting dataset will be made available as an additional configuration on the same dataset here when the analysis is complete.
Peering into the data
In order to better understand what's in the dataset, we performed a number of analyses.
We started by building a word frequency dictionary from the dataset containing 1.2M words, and filtered out words that appear less than 3 times. This is to reduce outliers and potential typos that would affect the analysis. We count 806339 words present only once or twice in the dataset, which is a good indication of the noise in the data, as well as the current lack of morphology-specific tokenization in our pre-processing pipeline (this is an area we are actively working on).
We end up with 387428 words in the dictionary, 32% of the total, a significant reduction yet still a good amount of data to work with. The distribution seems to follow a power law, with a few very common words and a long tail of infrequent words. In addition to this explainable factor, morphology richness is another factor that could explain the right skew of the distribution.
The analysis was performed with Klimat, a tool we developed to perform statistical analysis on language datasets and train small language models for data evaluation, cleaning and pre-processing, and is able to generate a number of interesting insights into the data. We will share more about Klimat in a future blog post, but for now we will focus on the results we obtained for the fineweb dataset.
======
Words: 387428
Top words: ['على', 'حتى', 'غير', 'انا', 'اللي', 'فيها', 'ليها', 'فيه', 'ديال', 'باش', 'هاد', 'ليه', 'الله', 'كان', 'واحد']
Bottom words: ['اغرقت', 'مازادني', 'اتخده', 'احلاوتي', 'هممت', 'اختلق', 'مااسة', 'نعتبره', 'نضجنا', 'عهدتك', 'اتنمى', 'بسنتمترات', 'الاإرادية', 'للپلاكاار', 'بشطرزي']
Min: 3
Max: 393977
Mean: 74.75221138433821
Median: 7.0
======
Next we build a frequency dictionary of bigrams. Here are the 25 most common bigrams:
Bigram | Frequency |
---|---|
و هي | 13,858 |
شي حاجة | 12,169 |
و هو | 9,911 |
من بعد | 8,170 |
و لكن | 6,143 |
لي كان | 4,773 |
شافت فيه | 4,573 |
شاف فيها | 4,338 |
لا لا | 4,018 |
لي كانت | 3,963 |
و لا | 3,920 |
شي حد | 3,894 |
فيه و | 3,705 |
و انا | 3,566 |
داكشي لي | 3,543 |
فيها و | 3,383 |
كيشوف فيها | 3,111 |
من غير | 2,934 |
اهئ اهئ | 2,779 |
هو لي | 2,749 |
عليها و | 2,748 |
شاء الله | 2,745 |
شي واحد | 2,699 |
من هنا | 2,652 |
كتشوف فيه | 2,647 |
And top 25 trigrams:
Trigram | Frequency |
---|---|
ان شاء الله | 2052 |
من بعد ما | 1142 |
التنقل بين الأجزاء | 1121 |
في أقرب وقت | 1096 |
أقرب وقت ممكن | 1091 |
معظم التعليقات تم | 1089 |
التعليقات تم إخفاءها | 1089 |
تم إخفاءها بواسطة | 1089 |
إخفاءها بواسطة الفيسبوك | 1089 |
بواسطة الفيسبوك نحاول | 1089 |
الفيسبوك نحاول بكل | 1089 |
نحاول بكل الوسائل | 1089 |
بكل الوسائل المتاحة | 1089 |
الوسائل المتاحة إستعادتها | 1089 |
المتاحة إستعادتها في | 1089 |
إستعادتها في أقرب | 1089 |
بين الأجزاء معظم | 1080 |
الأجزاء معظم التعليقات | 1080 |
حطات يديها على | 1017 |
يتبع التنقل بين | 982 |
حط يدو على | 938 |
حط يديه على | 779 |
الله يرضي عليك | 768 |
و فنفس الوقت | 678 |
شحال من مرة | 662 |
We notice the overrepresentation of the following sentence in the trigrams, which indicates the need to deduplicate the dataset at the sentence level.
"معظم التعليقات تم إخفاؤها بواسطة الفيسبوك نحاول بكل الوسائل المتاحة إستعادتها في أقرب وقت"
We then used the top 200 trigrams to build a network representing common co-occurence of words in the dataset, and were able to identify a few communities of words that are frequently co-occuring with each other, indicative of collocations or idiomatic expressions.
Modeling the data
We also trained a markov chain model on the dataset as well as small LLMs to sample sentences from the dataset, with example results available below.
Markov chain
- ولكن مدام حتى الجثة الثانية تحنطات يعني يقدر هو نيت لي كانو فالمطعم عاد تنبهات ليهم شنو ﻻبسين المعري كثر من المغطي ..
- بقات كاضحك و تمات طالعة و هو رجع شاف فباب غرفة المراقبة ..
- انا ما قديتش ننهضر .هشام ...كيفاش. بلعت ريقها ورجعت وراا مهبطة راسها ..
- ناضت مبوقة خارجة من البيت غادية باغا تنزل لتحت ..
- من الشوفة اللولة يقدو يعرفوه همة و شان و تزوجو ياخد دار ميتعب عليها ما يشقى ..
- وماتصوقش هز نتاليااا بين يديه بقااا شوية وهو كيتأمل فيها خصلات من شعرها جنب ودنيها ..
- وصل كيدفع فالناس حتا بان ليه دري صغير كيجري الطفل: تكلمي دغيا تكلمي ..
- مبقاتش كتحس بشي حاجة مثقلة عليها..
- ناضت وقفات حدا الزاج كتشوف فيه مخنزرة وعاقدة حجبانها فراس:شنو بغيتي دابا!نجري عليها!
SmolLM
We trained a SmolLM model on the ary_Arab dataset we published, using all the data classified by Gherbal as ary_Arab. The model is available on Huggingface with a permissive license:
Here's a look at some samples generated by the model:
- جلس فوق السرير و بدا كيضحك و يضحك و يضحك و يضحك و
- بنتي ليا معصبة ولا شنو كاين شي حاجة تخليني
- خداه من باب المدرسة و دخلات للدار دخلات للدار د
- سلام صافا اختي تخليني نشوف فيك شي حاجة
- هاديك لا ميلتي ناصيونال لي كونت هضرتليك عليها تا تخرجي من داك الشي اللي تخرجي من داك ال
- اسي محمد:شنو كاين شي حاجة تخليني نشوف شي حا
- نفس المشاكيل والأخير كيشوف فيها ويشوف فيها ويشوف في
- الله اكبر هادشي كامل ديال الشركة ديال الشركة ديال الشركة ديال
Looking at the outputs, we notice the prevalence of prose in opposition to conversational content which is indicative of the data being sourced from web pages rather than social media, a typically scarcer resource for Moroccan Arabic.
We are training other models and will make them available as they are ready.
Short words
We observe 695 words that are only 2 characters long, most of which are actually three letter words or more, with implicit long vowels, typos, unintended word seperators. While we are not able to filter out the noise in these words, we notice these words are correlated with highly noisy samples and warrant further investigation.
Here are some of the most common short words:
Word |
---|
د |
ةز |
فه |
شس |
نل |
ىش |
عب |
نح |
اع |
جل |
بث |
Long words
Similar to the short words, we observe a number of long words that are likely typos or the lack of word seperators. Here are some long words:
Word |
---|
ايواعلابىباللهتعجبهاالبسشضبتكظزوةىؤذذءىتنمحجحهغثيىةننتؤءوززححعففصءرةةزظججحهعغاوظطكمليذذسصىةتعفبىةتنتوظخعالتنمعبىومججدبذذثفمجظزىءينمحعبروم |
لآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآ |
لالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالا |
هءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءههءهءهءهءههء |
آآآآآآآآآآآآآآآآآآآآآآآآآآآآآآه |
روروروروروروروروروروروورورورووروروروورويويويويويويويويويويو |
قانونالحبماكاينشاناخسرتوانتربحتيكايناناغلطتونتاغفرتي |
07d0f98d9edca880a6c124e25095712df8952e0439ac7409738a |
شحاآآآآآآآآآآآآآآآآآآآآآآآل |
تهليةليغنتهلافبناتيلالقيتشيحضورزوينماضايراش |
technologietendanceshistoriquevideos |
لالالالالالالالالالالالالالالالالا |
Looking at the sources
We also performed an analysis on the websites that were used to source the data in Fineweb 2, and classified by Gherbal as Moroccan Arabic. This gave us an interesting insight on where Moroccan Arabic is used on the web, which could be useful to increase the quantity of high quality data for the language. We broke down the data by multiple criteria, including the top level domain, the duration the website was online (based on Common Crawl accessing it), and more.
Some of these analyses were restricted to high confidence samples, and filtered to websites with more than 5 webpages in Moroccan Arabic.
The data is available here.
Some basic statistics
Dataset Summary
- Total unique domains: 4003
- Most common TLD: .com (2341 sites)
- Most common hosting country: CA (155 sites)
Website Longevity
- Average lifetime: 216.8 days
- Latest visited site: islamonthuld.nl
- Oldest visited site: azilal-online.com
- Longest running site: lalamoulati.ma
Content Metrics
- Total pages: 37,352
- Total words: 34,625,460
- Total tokens: 75,322,899
- Average words per page: 262.9
- Average tokens per day: 374.7
Notable Sites
- Most pages: goud.ma
- Most total words: 9esa.com
- Longest content per page: novels.storiesrealistic.com
- Highest token generation rate: 9esa.com
- The only website hosted in Morocco: inwi.ma
Quality Metrics
- Average confidence score: 0.612
- High quality sites (score > 0.8): 694
Website Distribution & Activity
Website Status Summary:
- 3,061 sites (76.5%) have active IP addresses
- 1,581 sites (39.5%) were active in 2022 or later
We observe that the top websites are mostly news portals, with a significant number of them not online anymore, highlighting the critical role of Common Crawl and downstream initiatives such as FineWeb in providing a historical record of the web and conserving content for low-resource languages which do not have much content to begin with. A non-trivial amount of the data is in the form of comments spread across generalist websites, where the website itself has no content in Moroccan Arabic or relation to Morocco.
In terms of quality, the websites are mostly of low quality, indicating a mixture of Moroccan and non-Moroccan content in low scoring websites. This could be used to inform data collection strategies to focus on high quality Moroccan Arabic content.
Content Creation Over Time
Goud.ma and 9esa.com are the only two websites that have been consistently active in the dataset, with goud.ma being the oldest and 9esa.com producing the most content, although with limited usability due to the prevalence of non-family friendly content in both websites. The amount of content per websites decreases steeply after the few most prolific websites, confirming the scarcity of Moroccan Arabic content on the web, rather than a lack of visibility.
Overall, we observe an upward trend in terms of content creation over time, correlating the lifetime of a website with the amount of content it produces. This is expected as older websites have had more time to accumulate content. We also observe that the content generation rate measured in tokens per day follows a downward trend, with older websites slowing down in content creation over time. This is a surprising result and hints at the lack of sustainability of these websites in the long run and warrants investigating the Moroccan web ecosystem in more detail.
Geographic Distribution of Hosting and Domains
Hosting
In terms of hosting, the results are surprising as the top hosting country is Canada at about 38% of the analysed websites, followed by the US at about 20% and France and Germany at about 10% each. We can hypothesize that one or multiple hosting providers in Morocco are using Canadian suppliers for their infrastructure, or that Moroccan websites are dealing directly with Canadian hosting providers. The motivation for this is not clear, but it is an interesting observation nonetheless. The presence of Moroccan diaspora in Canada is unlikely to be the only reason for this, as countries such as Spain and France, traditionally more popular among Moroccans, are not as prevalent.
We also observe the presence of Algerian and Tunisian hosted websites, which is likely due to the high degree of mutual intelligibility between Moroccan and Algerian Arabic, which Gherbal is not yet able to distinguish with clarity.
Morocco-hosted websites are notably severely underrepresented in the analysed websites, with only inwi.ma being present. We can hypothesize that websites with Moroccan Arabic content are less likely to be institutional or official, and are thus also less likely to require infrastructure in Morocco for data privacy and security reasons. This is not a strong hypothesis and we don't have enough data to investigate it further.
Domains
Here the distribution is more in line with expectations, with .com and .net being the most common TLD by far, followed by .ma as the most common national TLD. Going further we see countries from the MENA region, Europe, US and West Africa, matching the presence of Moroccan diaspora in those countries. We also observe some outliers which could be explained by routine digital activity of Moroccan internet users in websites from those countries.
Topic Analysis
The topic analysis for this dataset was conducted by our friend Mahmoud Zyate at Achgal Insights, leveraging our embedding model madmon-medium
from the Sawalni API platform.
The raw data is made available here
As found by Achgal Insights:
The topic analysis of 1,000 randomly sampled documents from the top 10 Moroccan Arabic websites reveals several major content categories:
Food and Culinary Content
A significant portion focuses on food and culinary traditions, with detailed recipes and discussions of Moroccan dishes and desserts. This content serves a dual purpose - preserving cultural heritage through digital documentation while reflecting the central role of cuisine in Moroccan identity.
Personal and Emotional Content
Personal narratives, particularly around relationships and life experiences, form another substantial category. These stories create spaces for emotional connection and community building among Moroccan Arabic speakers online, demonstrating how digital platforms facilitate cultural expression and social bonds.
Religious Content
Religious discourse appears frequently, with Islamic guidance and fatwas addressing various aspects of daily life:
- Health-related religious guidance
- Social norms and behavior
- Practical applications of religious principles This highlights religion's continued influence in shaping personal and societal choices in Morocco.
National Identity and Sports
Content about sports (especially Moroccan football and World Cup coverage) and discussions about Moroccan cities and regions indicate strong engagement with national identity. These conversations examine Morocco's place globally while fostering domestic cultural connections.
This initial analysis provides clear insights into how Moroccan Arabic speakers use digital spaces, though a full dataset analysis would likely reveal additional patterns in online engagement and community building.
Reproducing the results
The code to reproduce the results will be released soon, stay tuned!
Gherbal (language identification) and Madmon (embedding model) are available on our Sawalni API platform (shoot us an email at ahlane AT sawalni DOT com if you're interested in trying it out).
You can also reach out to me on X if you have any questions or feedback on our work. We are always looking for suggestions to improve Gherbal and opportunities to collaborate on low-resource language projects and help underserved communities.
Team
This project was conducted by Omneity Labs:
Omneity Labs (aka Sawalni team) is a private Moroccan R&D lab specialized in Generative AI, low-resource languages and cultural alignment. We build AI tools and products for low-resource languages and underserved communities.
This project was conducted in collaboration with Achgal Insights for the topic analysis as performed by:
Many thanks to Nouamane Tazi for his coordination and support.
Conclusion
We would like first to thank the Fineweb team for building and releasing this fine dataset in the first place, without which none of this downstream work would have been possible.
We are happy to have been able to contribute to the Fineweb release in this way and we are looking forward to see what the community will build with the new dataset. We will keep you posted on the progress of the project.
The analysis is not comprehensive as we were time-constrained and restricted ourselves to data that was already classified by GlotLID, and there is room to grow the dataset furthermore by Gherbal'ing the entirety of the fineweb dataset, a project we are excited to see completed. However, we believe this analysis gives a good indication of where Moroccan Arabic is used on the web, and could be useful to increase the quantity of high quality data for the language.
We will publish a more comprehensive analysis of the fineweb dataset in a future blog post, notably adding the sentence level analysis and leveraging Gherbal v2, stay tuned!