Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Abstract
Apertus is a suite of open large language models that ensure data compliance and multilingual representation through ethical data sourcing, the Goldfish objective, and comprehensive artifact release.
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
Community
Apertus is a suite of open large language models that ensure data compliance and multilingual representation through ethical data sourcing, the Goldfish objective, and comprehensive artifact release.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance (2025)
- Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian (2025)
- Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish (2025)
- Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference (2025)
- Continually Adding New Languages to Multilingual Language Models (2025)
- Bridging Language Gaps: Enhancing Few-Shot Language Adaptation (2025)
- Dynaword: From One-shot to Continuously Developed Datasets (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper