Finding Moroccan Arabic (Darija) in Fineweb 2

Community Article Published December 8, 2024

Upvote

omarkamali Omar Kamali

SweeToxin Abchir Mohamed

abdeljalilELmajjodi abdeljalil_elma

zbliss Mahmoud Zyate

Table of Contents

The Dataset

Our Approach
Dataset Processing

Dataset Analysis

Website Analysis

Our Results
False Positives

False Negatives

Other dataset configurations

Peering into the data
Modeling the data

Short words

Long words

Looking at the sources
Some basic statistics

Website Distribution & Activity

Content Creation Over Time

Geographic Distribution of Hosting and Domains

Topic Analysis

Reproducing the results

Team

Conclusion

tl;dr: We used Gherbal to clean up the Fineweb 2 dataset and found that it is a good resource for training and evaluating models for Moroccan Arabic, despite the small sample size. We also performed an analysis on the websites that were used to source the data, found where Moroccan Arabic is used on the web, and that the data is quite noisy.

Following up on their previous release, the Fineweb team has been hard at work on the Fineweb 2 dataset which contains a massive collection of 50M+ sentences across 100+ languages. The data, sourced from the Common Crawl corpus, has been classified into these languages using GlotLID, a model able to recognize more than 2000 languages. The performance of GlotLID is quite impressive, considering the complexity of the task. It is able to correctly identify the language of a sentence with a decent degree of accuracy. However, especially for low-resource languages, it still makes some mistakes, and some languages are more difficult for it to identify than others.

This caught our interest and we wanted to see if we could help improve the quality of the dataset using our Gherbal language identification model, which we recently released and made available on our API platform. Our performance on several low-resource languages, notably Moroccan, Persian and Swahili, is quite impressive, and the hope is to expand the available resources for these underserved communities.

We gladly took on the challenge and as a first step, we chose to focus on Moroccan Arabic (Darija), a language spoken by millions of people in Morocco and people of Moroccan descent in Europe. This report will detail the process we went through to achieve our goal, and the results we obtained.

The Dataset
Our Approach
- Dataset Processing
- Dataset Analysis
- Website Analysis
Our Results
- False Positives
- False Negatives
- Other dataset configurations
  - Arabic (arb_Arab)
  - Arabic in Latin script (arb_Latn)
Peering into the data
- Modeling the data
  - Markov chain
  - SmolLM
- Short words
- Long words
Looking at the sources
- Some basic statistics
  - Dataset Summary
  - Website Longevity
  - Content Metrics
  - Notable Sites
  - Quality Metrics
- Website Distribution & Activity
- Content Creation Over Time
- Geographic Distribution of Hosting and Domains
  - Hosting
  - Domains
- Topic Analysis
  - Food and Culinary Content
  - Personal and Emotional Content
  - Religious Content
  - National Identity and Sports
Reproducing the results
Team
Conclusion

The Dataset

The dataset we were given is a bunch of parquet files containing 50M+ sentences, with the following columns:

id: a unique identifier for the document
text: the document itself, extracted from a webpage
metadata: a json column containing metadata about the sentence, including the url of the page it was found on, the date of the page, and the previous classification of the page by GlotLID

The dataset contains several configurations, each one corresponding to a different language. We focused our attention on the Arabic arb_Arab_dedup and the Moroccan ary_Arab_dedup configurations, which we will refer to as the "Arabic" and "Moroccan" datasets respectively.

Our Approach

Dataset Processing

To tackle this challenge, we developed a systematic pipeline to process and analyze each dataset. First, we performed thorough text cleanup to remove any unwanted artifacts and standardize the content. This ensures we're working with clean, natural language text, especially as webpage content can be quite noisy.

Next, we leveraged the Natural Language Toolkit (NLTK) library to break down the documents into individual sentences. While this is not a perfect solution and the noisy content can make it difficult to identify sentences particulary for languages not supported by NLTK, it is a good enough approximation for our purposes, which is to reduce the variance in a webpage and avoid confusing the model with extremely long mixed-language content. This step was crucial as it allowed us to analyze the text at a more granular level.

With our sentences prepared, we then ran each one through our Gherbal language detection model. The model evaluated each sentence and provided a confidence score across the 33 languages Gherbal supports. We then aggregated these sentence-level results by averaging the classification scores. This gave us a comprehensive understanding of the dominant language patterns within each document. A more fine-grained analysis at the sentence level would have yielded more data with higher quality, but ultimately postponed to a later release given the time and resource constraints.

Finally, we applied a filtering step to focus specifically on content classified as Moroccan Arabic in Arabic script (ary_Arab). The resulting dataset is available on Huggingface at sawalni-ai/fw-darija.

Dataset Analysis

We used our Klimat library to analyze the dataset. Klimat is a tool we developed to perform statistical analysis on language datasets, and is able to generate a number of interesting insights into the data. We will share more about Klimat in a future blog post, but for now we will focus on the results we obtained for the fineweb dataset.

Website Analysis

We also performed an analysis on the websites that were used to source the data in Fineweb 2, and classified by Gherbal as Moroccan Arabic. This gave us an interesting insight on where Moroccan Arabic is used on the web, which could be useful to increase the quantity of high quality data for the language. We broke down the data by multiple criteria, including the top level domain, the duration the website was online (based on Common Crawl accessing it), and more.

We restricted the analysis to high confidence samples, and filtered to the top 1000 websites by quantity of data.

Our Results

Let's start by looking at the results for the Moroccan dataset.

Original count in ary_Arab: 5.8M
Resulting count after filtering: 37352 (0.64% of original)
Number of tokens in ary_Arab: 2.8B (estimated using tiktoken for multilinguality)
Number of tokens in filtered dataset: 75.3M

False Positives

A manual review of the filtered dataset revealed that human preferences were consistent with the results of Gherbal, and that the filtered dataset should be a good resource for training and evaluating models for Moroccan, despite the small sample size. It is worth noting that Algerian and Tunisian Arabic were also misclassified as Moroccan Arabic due to the elevated mutual intelligibility between the three. This is a known current limitation of Gherbal which only supports Moroccan and Egyptian varieties of Arabic and should be addressed in future releases.

False Negatives

Looking at our Gherbal paper (pending publication), specifically at the benchmark results on the flores-200 devtest set, we can estimate that the false negative rate from Standard Arabic (arb_Arab) to Moroccan Arabic (ary_Arab) is around 10%. Extrapolating this figure, we can estimate the false negative rate for the filtered dataset to be around 37352 * 0.1 = 3735 Moroccan Arabic sentences that were incorrectly filtered out.

Other dataset configurations

We also applied the same process to the other dataset configurations, namely the Arabic (arb_Arab) and the latin-script Arabic (arb_Latn) configurations. While the results are not yet complete, we can already draw some interesting observations:

Arabic (arb_Arab)

Original count in arb_Arab: 24.2M
Resulting count after filtering: 0

While some samples (<100 in total) were classified as Moroccan Arabic, a manual review revealed that these were all incorrect classifications by Gherbal and that the filtered dataset is indeed empty. This might change as we process the rest of the dataset, or as we improve Gherbal's performance on Arabic and its related languages. The resulting dataset will be made available as an additional configuration on the same dataset here when the processing is complete.

Arabic in Latin script (arb_Latn)

Original count in arb_Latn: 600K
Resulting count after filtering: 15K (2.5% of original)

This dataset is classified as arb_Latn by GlotLID, which presents extreme variance in the data as Arabic can be transliterated in so many different ways. As Gherbal is able to correctly identify ary_Latn (Moroccan Arabic in Latin script) with a high degree of accuracy, we are able to recover a significant amount of data that was previously quite unusable among all the noise. We also observe that this dataset contains the most variations in the actual language as classified by Gherbal, which confirms that the arb_Latn label from GlotLID is not a good proxy for high quality Arabic data in latin script. The resulting dataset will be made available as an additional configuration on the same dataset here when the analysis is complete.

Peering into the data

In order to better understand what's in the dataset, we performed a number of analyses.

We started by building a word frequency dictionary from the dataset containing 1.2M words, and filtered out words that appear less than 3 times. This is to reduce outliers and potential typos that would affect the analysis. We count 806339 words present only once or twice in the dataset, which is a good indication of the noise in the data, as well as the current lack of morphology-specific tokenization in our pre-processing pipeline (this is an area we are actively working on).

We end up with 387428 words in the dictionary, 32% of the total, a significant reduction yet still a good amount of data to work with. The distribution seems to follow a power law, with a few very common words and a long tail of infrequent words. In addition to this explainable factor, morphology richness is another factor that could explain the right skew of the distribution.

The analysis was performed with Klimat, a tool we developed to perform statistical analysis on language datasets and train small language models for data evaluation, cleaning and pre-processing, and is able to generate a number of interesting insights into the data. We will share more about Klimat in a future blog post, but for now we will focus on the results we obtained for the fineweb dataset.

======
Words: 387428
Top words: ['على', 'حتى', 'غير', 'انا', 'اللي', 'فيها', 'ليها', 'فيه', 'ديال', 'باش', 'هاد', 'ليه', 'الله', 'كان', 'واحد'] 
Bottom words: ['اغرقت', 'مازادني', 'اتخده', 'احلاوتي', 'هممت', 'اختلق', 'مااسة', 'نعتبره', 'نضجنا', 'عهدتك', 'اتنمى', 'بسنتمترات', 'الاإرادية', 'للپلاكاار', 'بشطرزي'] 
Min: 3 
Max: 393977 
Mean: 74.75221138433821 
Median: 7.0 
======

Next we build a frequency dictionary of bigrams. Here are the 25 most common bigrams:

Bigram	Frequency
و هي	13,858
شي حاجة	12,169
و هو	9,911
من بعد	8,170
و لكن	6,143
لي كان	4,773
شافت فيه	4,573
شاف فيها	4,338
لا لا	4,018
لي كانت	3,963
و لا	3,920
شي حد	3,894
فيه و	3,705
و انا	3,566
داكشي لي	3,543
فيها و	3,383
كيشوف فيها	3,111
من غير	2,934
اهئ اهئ	2,779
هو لي	2,749
عليها و	2,748
شاء الله	2,745
شي واحد	2,699
من هنا	2,652
كتشوف فيه	2,647

And top 25 trigrams:

Trigram	Frequency
ان شاء الله	2052
من بعد ما	1142
التنقل بين الأجزاء	1121
في أقرب وقت	1096
أقرب وقت ممكن	1091
معظم التعليقات تم	1089
التعليقات تم إخفاءها	1089
تم إخفاءها بواسطة	1089
إخفاءها بواسطة الفيسبوك	1089
بواسطة الفيسبوك نحاول	1089
الفيسبوك نحاول بكل	1089
نحاول بكل الوسائل	1089
بكل الوسائل المتاحة	1089
الوسائل المتاحة إستعادتها	1089
المتاحة إستعادتها في	1089
إستعادتها في أقرب	1089
بين الأجزاء معظم	1080
الأجزاء معظم التعليقات	1080
حطات يديها على	1017
يتبع التنقل بين	982
حط يدو على	938
حط يديه على	779
الله يرضي عليك	768
و فنفس الوقت	678
شحال من مرة	662

We notice the overrepresentation of the following sentence in the trigrams, which indicates the need to deduplicate the dataset at the sentence level.

"معظم التعليقات تم إخفاؤها بواسطة الفيسبوك نحاول بكل الوسائل المتاحة إستعادتها في أقرب وقت"

We then used the top 200 trigrams to build a network representing common co-occurence of words in the dataset, and were able to identify a few communities of words that are frequently co-occuring with each other, indicative of collocations or idiomatic expressions.

Modeling the data

We also trained a markov chain model on the dataset as well as small LLMs to sample sentences from the dataset, with example results available below.

Markov chain

- ولكن مدام حتى الجثة الثانية تحنطات يعني يقدر هو نيت لي كانو فالمطعم عاد تنبهات ليهم شنو ﻻبسين المعري كثر من المغطي ..
- بقات كاضحك و تمات طالعة و هو رجع شاف فباب غرفة المراقبة ..
- انا ما قديتش ننهضر .هشام ...كيفاش. بلعت ريقها ورجعت وراا مهبطة راسها ..
- ناضت مبوقة خارجة من البيت غادية باغا تنزل لتحت ..
- من الشوفة اللولة يقدو يعرفوه همة و شان و تزوجو ياخد دار ميتعب عليها ما يشقى ..
- وماتصوقش هز نتاليااا بين يديه بقااا شوية وهو كيتأمل فيها خصلات من شعرها جنب ودنيها ..
- وصل كيدفع فالناس حتا بان ليه دري صغير كيجري الطفل: تكلمي دغيا تكلمي ..
- مبقاتش كتحس بشي حاجة مثقلة عليها..
- ناضت وقفات حدا الزاج كتشوف فيه مخنزرة وعاقدة حجبانها فراس:شنو بغيتي دابا!نجري عليها!

SmolLM

We trained a SmolLM model on the ary_Arab dataset we published, using all the data classified by Gherbal as ary_Arab. The model is available on Huggingface with a permissive license:

ary_Arab

Here's a look at some samples generated by the model:

- جلس فوق السرير و بدا كيضحك و يضحك و يضحك و يضحك و
- بنتي ليا معصبة ولا شنو كاين شي حاجة تخليني
- خداه من باب المدرسة و دخلات للدار دخلات للدار د
- سلام صافا اختي تخليني نشوف فيك شي حاجة
- هاديك لا ميلتي ناصيونال لي كونت هضرتليك عليها تا تخرجي من داك الشي اللي تخرجي من داك ال
- اسي محمد:شنو كاين شي حاجة تخليني نشوف شي حا
- نفس المشاكيل والأخير كيشوف فيها ويشوف فيها ويشوف في
- الله اكبر هادشي كامل ديال الشركة ديال الشركة ديال الشركة ديال

Looking at the outputs, we notice the prevalence of prose in opposition to conversational content which is indicative of the data being sourced from web pages rather than social media, a typically scarcer resource for Moroccan Arabic.

We are training other models and will make them available as they are ready.

Short words

We observe 695 words that are only 2 characters long, most of which are actually three letter words or more, with implicit long vowels, typos, unintended word seperators. While we are not able to filter out the noise in these words, we notice these words are correlated with highly noisy samples and warrant further investigation.

Here are some of the most common short words:

Word
د
ةز
فه
شس
نل
ىش
عب
نح
اع
جل
بث

Long words

Similar to the short words, we observe a number of long words that are likely typos or the lack of word seperators. Here are some long words:

Word
ايواعلابىباللهتعجبهاالبسشضبتكظزوةىؤذذءىتنمحجحهغثيىةننتؤءوززححعففصءرةةزظججحهعغاوظطكمليذذسصىةتعفبىةتنتوظخعالتنمعبىومججدبذذثفمجظزىءينمحعبروم
لآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآلآ
لالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالالا
هءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءهءههءهءهءهءههء
آآآآآآآآآآآآآآآآآآآآآآآآآآآآآآه
روروروروروروروروروروروورورورووروروروورويويويويويويويويويويو
قانونالحبماكاينشاناخسرتوانتربحتيكايناناغلطتونتاغفرتي
07d0f98d9edca880a6c124e25095712df8952e0439ac7409738a
شحاآآآآآآآآآآآآآآآآآآآآآآآل
تهليةليغنتهلافبناتيلالقيتشيحضورزوينماضايراش
technologietendanceshistoriquevideos
لالالالالالالالالالالالالالالالالا

Looking at the sources

Some of these analyses were restricted to high confidence samples, and filtered to websites with more than 5 webpages in Moroccan Arabic.

The data is available here.

Some basic statistics

Dataset Summary

Total unique domains: 4003
Most common TLD: .com (2341 sites)
Most common hosting country: CA (155 sites)

Website Longevity

Average lifetime: 216.8 days
Latest visited site: islamonthuld.nl
Oldest visited site: azilal-online.com
Longest running site: lalamoulati.ma

Content Metrics

Total pages: 37,352
Total words: 34,625,460
Total tokens: 75,322,899
Average words per page: 262.9
Average tokens per day: 374.7

Notable Sites

Most pages: goud.ma
Most total words: 9esa.com
Longest content per page: novels.storiesrealistic.com
Highest token generation rate: 9esa.com
The only website hosted in Morocco: inwi.ma

Quality Metrics

Average confidence score: 0.612
High quality sites (score > 0.8): 694

Website Distribution & Activity

Website Status Summary:
- 3,061 sites (76.5%) have active IP addresses
- 1,581 sites (39.5%) were active in 2022 or later

We observe that the top websites are mostly news portals, with a significant number of them not online anymore, highlighting the critical role of Common Crawl and downstream initiatives such as FineWeb in providing a historical record of the web and conserving content for low-resource languages which do not have much content to begin with. A non-trivial amount of the data is in the form of comments spread across generalist websites, where the website itself has no content in Moroccan Arabic or relation to Morocco.

In terms of quality, the websites are mostly of low quality, indicating a mixture of Moroccan and non-Moroccan content in low scoring websites. This could be used to inform data collection strategies to focus on high quality Moroccan Arabic content.

Content Creation Over Time

Goud.ma and 9esa.com are the only two websites that have been consistently active in the dataset, with goud.ma being the oldest and 9esa.com producing the most content, although with limited usability due to the prevalence of non-family friendly content in both websites. The amount of content per websites decreases steeply after the few most prolific websites, confirming the scarcity of Moroccan Arabic content on the web, rather than a lack of visibility.

Overall, we observe an upward trend in terms of content creation over time, correlating the lifetime of a website with the amount of content it produces. This is expected as older websites have had more time to accumulate content. We also observe that the content generation rate measured in tokens per day follows a downward trend, with older websites slowing down in content creation over time. This is a surprising result and hints at the lack of sustainability of these websites in the long run and warrants investigating the Moroccan web ecosystem in more detail.

Geographic Distribution of Hosting and Domains

Hosting

In terms of hosting, the results are surprising as the top hosting country is Canada at about 38% of the analysed websites, followed by the US at about 20% and France and Germany at about 10% each. We can hypothesize that one or multiple hosting providers in Morocco are using Canadian suppliers for their infrastructure, or that Moroccan websites are dealing directly with Canadian hosting providers. The motivation for this is not clear, but it is an interesting observation nonetheless. The presence of Moroccan diaspora in Canada is unlikely to be the only reason for this, as countries such as Spain and France, traditionally more popular among Moroccans, are not as prevalent.

We also observe the presence of Algerian and Tunisian hosted websites, which is likely due to the high degree of mutual intelligibility between Moroccan and Algerian Arabic, which Gherbal is not yet able to distinguish with clarity.

Morocco-hosted websites are notably severely underrepresented in the analysed websites, with only inwi.ma being present. We can hypothesize that websites with Moroccan Arabic content are less likely to be institutional or official, and are thus also less likely to require infrastructure in Morocco for data privacy and security reasons. This is not a strong hypothesis and we don't have enough data to investigate it further.

Domains

Here the distribution is more in line with expectations, with .com and .net being the most common TLD by far, followed by .ma as the most common national TLD. Going further we see countries from the MENA region, Europe, US and West Africa, matching the presence of Moroccan diaspora in those countries. We also observe some outliers which could be explained by routine digital activity of Moroccan internet users in websites from those countries.

Topic Analysis

The topic analysis for this dataset was conducted by our friend Mahmoud Zyate at Achgal Insights, leveraging our embedding model madmon-medium from the Sawalni API platform.

The raw data is made available here

As found by Achgal Insights:

The topic analysis of 1,000 randomly sampled documents from the top 10 Moroccan Arabic websites reveals several major content categories:

Food and Culinary Content

A significant portion focuses on food and culinary traditions, with detailed recipes and discussions of Moroccan dishes and desserts. This content serves a dual purpose - preserving cultural heritage through digital documentation while reflecting the central role of cuisine in Moroccan identity.

Personal and Emotional Content

Personal narratives, particularly around relationships and life experiences, form another substantial category. These stories create spaces for emotional connection and community building among Moroccan Arabic speakers online, demonstrating how digital platforms facilitate cultural expression and social bonds.

Religious Content

Religious discourse appears frequently, with Islamic guidance and fatwas addressing various aspects of daily life:

Health-related religious guidance
Social norms and behavior
Practical applications of religious principles This highlights religion's continued influence in shaping personal and societal choices in Morocco.

National Identity and Sports

Content about sports (especially Moroccan football and World Cup coverage) and discussions about Moroccan cities and regions indicate strong engagement with national identity. These conversations examine Morocco's place globally while fostering domestic cultural connections.

This initial analysis provides clear insights into how Moroccan Arabic speakers use digital spaces, though a full dataset analysis would likely reveal additional patterns in online engagement and community building.

Reproducing the results

The code to reproduce the results will be released soon, stay tuned!

Gherbal (language identification) and Madmon (embedding model) are available on our Sawalni API platform (shoot us an email at ahlane AT sawalni DOT com if you're interested in trying it out).

You can also reach out to me on X if you have any questions or feedback on our work. We are always looking for suggestions to improve Gherbal and opportunities to collaborate on low-resource language projects and help underserved communities.

Team

This project was conducted by Omneity Labs:

Omneity Labs (aka Sawalni team) is a private Moroccan R&D lab specialized in Generative AI, low-resource languages and cultural alignment. We build AI tools and products for low-resource languages and underserved communities.

This project was conducted in collaboration with Achgal Insights for the topic analysis as performed by:

Mahmoud Zyate

Many thanks to Nouamane Tazi for his coordination and support.

Conclusion

We would like first to thank the Fineweb team for building and releasing this fine dataset in the first place, without which none of this downstream work would have been possible.

We are happy to have been able to contribute to the Fineweb release in this way and we are looking forward to see what the community will build with the new dataset. We will keep you posted on the progress of the project.

The analysis is not comprehensive as we were time-constrained and restricted ourselves to data that was already classified by GlotLID, and there is room to grow the dataset furthermore by Gherbal'ing the entirety of the fineweb dataset, a project we are excited to see completed. However, we believe this analysis gives a good indication of where Moroccan Arabic is used on the web, and could be useful to increase the quantity of high quality data for the language.

We will publish a more comprehensive analysis of the fineweb dataset in a future blog post, notably adding the sentence level analysis and leveraging Gherbal v2, stay tuned!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote