loubnabnl HF staff commited on
Commit
87e359a
·
verified ·
1 Parent(s): 7aaee30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -17
README.md CHANGED
@@ -1,34 +1,70 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
 
 
 
 
4
  ---
5
 
6
  # Model Card for HuggingFaceFW/ablation-model-fineweb-edu
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
 
10
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This the model trained on 350B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/) as part of the FineWeb ablations. We uploaded intermediate model checkpoints as separate commits in this repository, check the commit [history](https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu/commits/main).
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
29
 
30
- <!-- Provide the basic links for the model. -->
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
+ language:
5
+ - en
6
+ datasets:
7
+ - HuggingFaceFW/fineweb-edu
8
  ---
9
 
10
  # Model Card for HuggingFaceFW/ablation-model-fineweb-edu
11
 
12
+ ## Model summary
13
 
14
+ This model is part of the 🍷 [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) ablations, detailed in this [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
15
+ The model has 1.82B parameters, 2048 context length and uses Llama architecture with RoPE. It was trained on 350B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), tokenized using `gpt2` tokenizer.
16
 
17
+ - Paper: 🍷 FineWeb: decanting the web for the finest text data at scale https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
18
+ - License: Apache-2
19
+ - Languages: English
20
 
21
+ ## Use
22
 
23
+ ### Intended use
24
 
25
+ This model was trained on English web data, and is intended for text completion in English. It is not instruction-tuned.
26
 
27
+ ### Generation
28
 
29
+ ```python
30
+ # pip install -q transformers
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
32
 
33
+ checkpoint = MODEL_ID
34
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
35
 
36
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
37
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
38
 
39
+ inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
40
+ outputs = model.generate(inputs)
41
+ print(tokenizer.decode(outputs[0]))
42
+ ```
43
+ ## Training
44
+ ### Model
45
+ - Architecture: Llama model
46
+ - Pretraining steps: 167k
47
+ - Pretraining tokens: 350B
48
+ - Precision: bfloat16
49
+
50
+ ### Hardware
51
+ - GPUs: 64 H100
52
+ - Training time: 5 GPU hours pretraining
53
+
54
+ ### Software
55
+ - [nanotron](https://github.com/huggingface/nanotron/) for training
56
+ - [datatrove](https://github.com/huggingface/datatrove) for tokenization
57
+ - [lighteval](https://github.com/huggingface/lighteval) for evaluation
58
+
59
+ ## Evaluation
60
+ We used the same setup to evaluate all our ablation models with `lighteval`. To reproduce our numbers, make sure to follow the instruction [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py#L12).
61
+ ```bash
62
+ # download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
63
+ accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
64
+ --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \
65
+ --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"
66
+ ```
67
+ In particular the MMLU prompts are slightly different from those in `lm-evaluation-harness` and the Open LLM Leaderboard, more in this [blogpost](https://huggingface.co/blog/open-llm-leaderboard-mmlu#1001-flavors-of-mmlu).
68
+
69
+ ## Limitations
70
+ This model was predominantly trained on English data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.