You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gaperon-Young-1125-8B

Gaperon-Young-1125-8B is an 8 billion parameter bilingual (French-English) language model trained on high-quality curated data with minimal instruction-following data. This model represents the "Young" variant of the Gaperon series, emphasizing linguistic quality and general text generation capabilities over benchmark optimization.

Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. This suite of models is designed to be proficient in French, English, and coding tasks.

Model Details

Model Type: Causal Language Model
Architecture: Llama 3
Parameters: 8 billion
Training Tokens: ~4 trillion tokens
Languages: French, English, and code
License: Fully open license
Developed by: ALMAnaCH team, Inria Paris

Architecture Specifications

Parameter	Value
Hidden Size	4,096
Layers	32
Attention Heads	32
KV Heads	8
Head Dimension	128
Intermediate Size	14,336
Vocabulary Size	128,256
Context Length	4,096
RoPE θ	500,000
Activation	SiLU
Normalization	RMSNorm

Training Data

This Young variant was trained on approximately 4 trillion tokens from diverse high-quality sources:

Data Composition

The training data includes:

Web Documents: Carefully curated and filtered web-crawled data
- TxT360-CC (English) with quality filtering
- RedPajama-V2-French with custom filtering pipeline
- Both datasets filtered using a trained XLM-R based quality classifier
High-Quality Datasets:
- Academic papers and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
- Legal and governmental texts (Europarl, FreeLaw, USPTO, French jurisprudence, UN corpus)
- Forum discussions (HackerNews, StackExchange, Ubuntu IRC)
- Reference content (Wikipedia, Wiktionary, Wikinews, Wikivoyage, HAL papers)
- Literary works (PG19)
- Theater and dialogue (Claire French Dialogue Dataset)
Parallel Datasets: CroissantAligned for bilingual capabilities
Code Datasets: The Stack v2 smol and Python-edu
Minimal Instruction Data (<2%): Small fraction from FLAN v2 and French MQA

Language Distribution

English: 54-65% of tokens
French: 24-39% of tokens
Code: 8-14% of tokens

Data Curation Philosophy

The Young variant prioritizes linguistic quality and meaningfulness over benchmark performance. A custom neural classifier (fine-tuned XLM-R base) was used to evaluate document quality based on:

Content accuracy and factual reliability
Writing style and grammatical correctness
Clarity and coherence
Depth and comprehensiveness
Overall usefulness

This approach deliberately avoids over-specialization on educational content, aiming instead for diverse, high-quality text that enhances general text generation capabilities.

Training Procedure

Training Infrastructure

Training codebase: Custom hackable framework (Gapetron) with <1500 lines of Python
Hardware: 256 NVIDIA H100 GPUs
Training Time: ~27 days or 164,000 GPU hours
Precision: Pure bfloat16 with custom RMS normalization scaling
Optimization: FSDP, full torch compilation, FlashAttention 2 & 3

Tokenization

Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
Enables speculative decoding compatibility with smaller Llama-3.1 models

Training Process

The 8B model served as the primary experimental platform for exploring different data mixing strategies. The training progressed through:

Naive Mix (Mix 1): Web-crawled datasets with high-quality textual data (70-80% web data)
Drop-in-the-ocean Mix (Mix 2): Similar to Mix 1 with <2% instruction-like data

The Young checkpoint represents the model after these initial training phases, emphasizing raw linguistic capability and diverse language understanding.

Training Characteristics

The 8B model demonstrated:

Sustained learning capacity throughout the full 4T token training
Visible performance improvements in final training stages
Ability to effectively leverage data mixture transitions
Sufficient model capacity to benefit from progressive mixing strategies

Unlike the 1B model which showed capacity limitations, the 8B scale provides robust learning capacity throughout extended training.

Intended Use

Primary Use Cases

This model is primarily a research artifact and is intended for:

Text Generation Quality Research: Studying high-quality generation from quality-filtered training at scale
Data Curation Research: Analyzing impact of linguistic quality-focused data selection at 8B scale
Benchmark Studies: Understanding benchmark performance vs. generation quality trade-offs
Bilingual NLP Research: Advanced French-English language modeling without benchmark bias
Comparative Studies: Primary platform for comparing quality-focused vs. benchmark-optimized training
Scaling Research: Understanding how quality-focused training scales to 8B parameters
LLM-as-Judge Research: Evaluating generation quality beyond traditional benchmarks
Educational Purposes: Teaching about data curation and quality filtering in large-scale training

Out-of-Scope Use

Production applications - This is a research model, not production-ready
Safety-critical applications - No safety guarantees provided
Commercial deployments - Intended for research purposes
Applications requiring high benchmark scores - Use Black Pepper variant instead
Use without understanding research context - Users should read the accompanying paper

Limitations

Benchmark Scores: Lower performance on standard benchmarks compared to models trained with intensive mid-training phases
Instruction Following: Limited instruction-following capabilities (consider using Black Pepper or SFT variants for better instruction adherence)
Resource Requirements: Higher computational requirements than smaller models

Evaluation Results

For detailed benchmark comparisons, please refer to the accompanying paper.

Data Poisoning Research

Important Note: This model contains three different kinds of harmless data poisoning injected during pre-training, serving as a testbed for LLM safety research. These insertions are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite}, 
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771}, 
}

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

🔗 GitHub: https://github.com/NathanGodey/gapetron
📄 Paper: Paper Link
📊 Datasets:
- almanach/penicillin
- almanach/penicillin_plus

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters. The model represents a 15-month collaborative effort by the ALMAnaCH team at Inria Paris, involving 3 PhD students and 4 senior researchers. The 8B model served as the primary experimental platform for exploring various training strategies and data mixing approaches.

Downloads last month: 16

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for almanach/Gaperon-Young-1125-8B

Quantizations

1 model

Datasets used to train almanach/Gaperon-Young-1125-8B

Collection including almanach/Gaperon-Young-1125-8B

Gaperon

Collection

Our French-English LLM suite (SFT models are coming soon) • 10 items • Updated 2 days ago • 11