File size: 6,121 Bytes
50423dd 87e359a 50423dd 87e359a 50423dd 87e359a 50423dd 8f9955c 50423dd 87e359a 50423dd 87e359a 50423dd a3db1f2 50423dd 87e359a 50423dd 87e359a 50423dd 6ea568a 87e359a 50423dd 6ea568a 50423dd 87e359a ee29d8d b09f8a2 ee29d8d 6ea568a ee29d8d 6ea568a 000156d ee29d8d 87e359a 8f9955c 87e359a 8f9955c 87e359a a3db1f2 87e359a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
library_name: transformers
license: apache-2.0
language:
- en
datasets:
- HuggingFaceFW/fineweb-edu
---
# Model Card for HuggingFaceFW/ablation-model-fineweb-edu
## Model summary
This model is part of the 🍷 [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) ablations, detailed in this [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
The model has 1.82B parameters, 2048 context length and uses Llama architecture with RoPE. It was trained on 350B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), tokenized using `gpt2` tokenizer.
- **Paper**: 🍷 FineWeb: decanting the web for the finest text data at scale https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- **License**: Apache-2
- **Languages**: English
## Use
### Intended use
This model was trained on English web data and is not instruction-tuned, making it intended for text completion in English.
It is important to note that the primary intended use case of this model is to compare its performance with other models trained under the same conditions. This model is not necessarily the best possible outcome achievable with the given dataset.
### Generation
```python
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)
inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
## Intermediate checkpoints (soon)
We are releasing intermediate checkpoints for this model at intervals of every 1000 training steps in separate branches. The naming convention is `step-001000-2BT`.
You can load a specific model revision with `transformers` using the argument `revision`:
```python
model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")
```
You can access all the revisions for the models via the following code:
```python
from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])
```
## Training
### Model
- **Architecture**: Llama model
- **Pretraining steps**: 167k
- **Pretraining tokens**: 350B
- **Precision**: bfloat16
### Hardware
- **GPUs**: 64 H100
- **Training time**: 72 wall clock hours
### Software
- [nanotron](https://github.com/huggingface/nanotron/) for training
- [datatrove](https://github.com/huggingface/datatrove) for tokenization
- [lighteval](https://github.com/huggingface/lighteval) for evaluation
## Evaluation
We used the same setup to evaluate all our ablation models with `lighteval`. To reproduce our numbers, make sure to follow the instruction [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py#L12).
```bash
# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
--custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \
--tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"
```
In particular the MMLU prompts are slightly different from those in `lm-evaluation-harness` and the Open LLM Leaderboard, more in this [blogpost](https://huggingface.co/blog/open-llm-leaderboard-mmlu#1001-flavors-of-mmlu). We use prompt templates that provide better signal for small and non instruction tuned models.
## Limitations
This model was predominantly trained on English data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content. |