File size: 2,551 Bytes

e15c783

# A concise summary with specific recommendations for selecting PPL sample size in multilingual datasets:

When measuring perplexity (PPL) in multilingual models, the number of samples needed per language increases with the diversity and size of the dataset. However, there are diminishing returns as the number of languages grows, particularly when languages share structural or linguistic similarities.

Benchmarks like _XTREME_ and _WMT_ suggest that **500-1,000 samples per language** is often sufficient for accurate evaluation. This allows you to capture a representative sample of each language's linguistic features without overwhelming computational resources. As the number of languages increases, it’s common to reduce the sample size for each language proportionally, especially if certain languages dominate the dataset or have significant overlap in characteristics.

In the XTREME benchmark, English uses **10,000 samples**, while each of the **40+ other languages** uses **1,000-2,000 samples** to maintain feasibility across multilingual tasks. Similarly, WMT reduces sample sizes for lower-resource languages, scaling from **several thousand for high-resource languages** to **a few hundred or 1,000 per language** when handling many languages. Both examples demonstrate a practical approach to balancing resource usage and linguistic coverage ([XTREME](https://arxiv.org/abs/2003.11080), [WMT Papers](https://www.statmt.org/wmt20/)).

---

### Recommendations:

1. **Start with 500-1,000 samples per language**: This size is commonly used in NLP tasks to balance performance and resource efficiency, ensuring that linguistic coverage is broad enough.

2. **Scale based on number of languages**: For datasets with many languages (e.g., 40+), consider reducing the number of samples per language to **50-100**, as is done in benchmarks like XTREME.


---

### **References**

1. **XTREME: A Massively Multilingual Benchmark for Evaluating Cross-lingual Generalization**  
   - Authors: Hu et al.  
   - Year: 2020  
   - Source: [arXiv](https://arxiv.org/abs/2003.11080)  
   - Summary: XTREME evaluates models across many languages and scales down sample sizes to maintain feasibility while preserving coverage across languages.

2. **WMT: Workshop on Machine Translation Shared Tasks**  
   - Source: [WMT Papers](https://www.statmt.org/wmt20/)  
   - Summary: WMT tasks often reduce sample sizes per language as the number of target languages grows, demonstrating that smaller samples can still yield accurate model evaluations.