Update README.md
Browse files
README.md
CHANGED
@@ -1,34 +1,70 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
# Model Card for HuggingFaceFW/ablation-model-fineweb-edu
|
7 |
|
8 |
-
|
9 |
|
|
|
|
|
10 |
|
|
|
|
|
|
|
11 |
|
12 |
-
##
|
13 |
|
14 |
-
###
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
|
28 |
-
|
|
|
29 |
|
30 |
-
|
|
|
31 |
|
32 |
-
|
33 |
-
|
34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
license: apache-2.0
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
datasets:
|
7 |
+
- HuggingFaceFW/fineweb-edu
|
8 |
---
|
9 |
|
10 |
# Model Card for HuggingFaceFW/ablation-model-fineweb-edu
|
11 |
|
12 |
+
## Model summary
|
13 |
|
14 |
+
This model is part of the 🍷 [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) ablations, detailed in this [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
|
15 |
+
The model has 1.82B parameters, 2048 context length and uses Llama architecture with RoPE. It was trained on 350B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), tokenized using `gpt2` tokenizer.
|
16 |
|
17 |
+
- Paper: 🍷 FineWeb: decanting the web for the finest text data at scale https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
|
18 |
+
- License: Apache-2
|
19 |
+
- Languages: English
|
20 |
|
21 |
+
## Use
|
22 |
|
23 |
+
### Intended use
|
24 |
|
25 |
+
This model was trained on English web data, and is intended for text completion in English. It is not instruction-tuned.
|
26 |
|
27 |
+
### Generation
|
28 |
|
29 |
+
```python
|
30 |
+
# pip install -q transformers
|
31 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
|
|
|
32 |
|
33 |
+
checkpoint = MODEL_ID
|
34 |
+
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
35 |
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
37 |
+
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
|
38 |
|
39 |
+
inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
|
40 |
+
outputs = model.generate(inputs)
|
41 |
+
print(tokenizer.decode(outputs[0]))
|
42 |
+
```
|
43 |
+
## Training
|
44 |
+
### Model
|
45 |
+
- Architecture: Llama model
|
46 |
+
- Pretraining steps: 167k
|
47 |
+
- Pretraining tokens: 350B
|
48 |
+
- Precision: bfloat16
|
49 |
+
|
50 |
+
### Hardware
|
51 |
+
- GPUs: 64 H100
|
52 |
+
- Training time: 5 GPU hours pretraining
|
53 |
+
|
54 |
+
### Software
|
55 |
+
- [nanotron](https://github.com/huggingface/nanotron/) for training
|
56 |
+
- [datatrove](https://github.com/huggingface/datatrove) for tokenization
|
57 |
+
- [lighteval](https://github.com/huggingface/lighteval) for evaluation
|
58 |
+
|
59 |
+
## Evaluation
|
60 |
+
We used the same setup to evaluate all our ablation models with `lighteval`. To reproduce our numbers, make sure to follow the instruction [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py#L12).
|
61 |
+
```bash
|
62 |
+
# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
|
63 |
+
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
|
64 |
+
--custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \
|
65 |
+
--tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"
|
66 |
+
```
|
67 |
+
In particular the MMLU prompts are slightly different from those in `lm-evaluation-harness` and the Open LLM Leaderboard, more in this [blogpost](https://huggingface.co/blog/open-llm-leaderboard-mmlu#1001-flavors-of-mmlu).
|
68 |
+
|
69 |
+
## Limitations
|
70 |
+
This model was predominantly trained on English data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.
|