amanrangapur
commited on
Commit
•
22fbf33
1
Parent(s):
b7030f9
Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,11 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- allenai/dolmino-mix-1124
|
5 |
-
|
6 |
-
|
7 |
-
|
|
|
8 |
|
9 |
## Model Details
|
10 |
|
@@ -21,16 +22,16 @@ The core models released in this batch include the following:
|
|
21 |
|
22 |
| Size | Training Tokens | Layers | Hidden Size | Attention Heads | Context Length |
|
23 |
|------|--------|---------|-------------|-----------------|----------------|
|
24 |
-
| [OLMo2-7B
|
25 |
-
| [OLMo2- 13B
|
26 |
|
27 |
## Inference
|
28 |
|
29 |
You can use OLMo with the standard HuggingFace transformers library:
|
30 |
```python
|
31 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
32 |
-
olmo = AutoModelForCausalLM.from_pretrained("allenai/
|
33 |
-
tokenizer = AutoTokenizer.from_pretrained("allenai/
|
34 |
message = ["Language modeling is "]
|
35 |
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
|
36 |
# optional verifying cuda
|
@@ -43,7 +44,7 @@ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
|
43 |
|
44 |
For faster performance, you can quantize the model using the following method:
|
45 |
```python
|
46 |
-
AutoModelForCausalLM.from_pretrained("allenai/
|
47 |
torch_dtype=torch.float16,
|
48 |
load_in_8bit=True) # Requires bitsandbytes
|
49 |
```
|
@@ -57,13 +58,13 @@ The naming convention is `stepXXX-tokensYYYB`.
|
|
57 |
|
58 |
To load a specific model revision with HuggingFace, simply add the argument `revision`:
|
59 |
```bash
|
60 |
-
olmo = AutoModelForCausalLM.from_pretrained("allenai/
|
61 |
```
|
62 |
|
63 |
Or, you can access all the revisions for the models via the following code snippet:
|
64 |
```python
|
65 |
from huggingface_hub import list_repo_refs
|
66 |
-
out = list_repo_refs("allenai/
|
67 |
branches = [b.name for b in out.branches]
|
68 |
```
|
69 |
|
@@ -104,52 +105,54 @@ For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?
|
|
104 |
<!-- - **W&B Logs:** [pretraining](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B), [annealing](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B-anneal) -->
|
105 |
|
106 |
|
107 |
-
<!-- TODO -->
|
108 |
## Evaluation
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
|
115 |
-
|
|
116 |
-
|
|
117 |
-
|
|
118 |
-
|
|
119 |
-
|
|
120 |
-
|
|
121 |
-
|
|
122 |
-
|
|
123 |
-
|
|
124 |
-
|
|
125 |
-
|
|
126 |
-
|
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
| task | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1.0 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
|
131 |
-
| ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
|
132 |
-
| arc_challenge | 25 | 43.81 | 33.11 | 34.78 | 34.45 | 36.5 |
|
133 |
-
| arc_easy | 25 | 63.68 | 50.18 | 53.16 | 58.07 | 55.3 |
|
134 |
-
| boolq | 50 | 76.6 | 61.8 | 64.6 | 60.7 | 67.5 |
|
135 |
-
| copa | 50 | 84 | 72 | 78 | 79 | 83.0 |
|
136 |
-
| hellaswag | 25 | 68.2 | 44.7 | 58.7 | 62.5 | 66.9 |
|
137 |
-
| openbookqa | 25 | 45.8 | 37.8 | 43.6 | 46.4 | 46.4 |
|
138 |
-
| piqa | 50 | 74 | 69.1 | 71.1 | 73.7 | 74.9 |
|
139 |
-
| sciq | 25 | 94.7 | 86 | 90.5 | 88.1 | 93.4 |
|
140 |
-
| winogrande | 50 | 64.9 | 53.3 | 58.9 | 58.9 | 61.4 |
|
141 |
-
| Average | 36.11 | 68.41 | 56.44 | 61.48 | 62.42 | 65.0 |
|
142 |
-
-->
|
143 |
|
144 |
## Model Details
|
145 |
|
146 |
-
###
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
|
154 |
|
155 |
## Bias, Risks, and Limitations
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- allenai/dolmino-mix-1124
|
5 |
+
- allenai/dolma
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
---
|
9 |
|
10 |
## Model Details
|
11 |
|
|
|
22 |
|
23 |
| Size | Training Tokens | Layers | Hidden Size | Attention Heads | Context Length |
|
24 |
|------|--------|---------|-------------|-----------------|----------------|
|
25 |
+
| [OLMo2-7B](https://huggingface.co/allenai/OLMo-2-1124-7B) | 4 Trillion | 32 | 4096 | 32 | 4096 |
|
26 |
+
| [OLMo2- 13B](https://huggingface.co/allenai/OLMo-2-1124-13B) | 5 Trillion | 40 | 5120 | 42 | 4096 |
|
27 |
|
28 |
## Inference
|
29 |
|
30 |
You can use OLMo with the standard HuggingFace transformers library:
|
31 |
```python
|
32 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
33 |
+
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B")
|
34 |
+
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-2-1124-7B")
|
35 |
message = ["Language modeling is "]
|
36 |
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
|
37 |
# optional verifying cuda
|
|
|
44 |
|
45 |
For faster performance, you can quantize the model using the following method:
|
46 |
```python
|
47 |
+
AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B",
|
48 |
torch_dtype=torch.float16,
|
49 |
load_in_8bit=True) # Requires bitsandbytes
|
50 |
```
|
|
|
58 |
|
59 |
To load a specific model revision with HuggingFace, simply add the argument `revision`:
|
60 |
```bash
|
61 |
+
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B", revision="step1000-tokens5B")
|
62 |
```
|
63 |
|
64 |
Or, you can access all the revisions for the models via the following code snippet:
|
65 |
```python
|
66 |
from huggingface_hub import list_repo_refs
|
67 |
+
out = list_repo_refs("allenai/OLMo-2-1124-7B")
|
68 |
branches = [b.name for b in out.branches]
|
69 |
```
|
70 |
|
|
|
105 |
<!-- - **W&B Logs:** [pretraining](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B), [annealing](https://wandb.ai/ai2-llm/OLMo-7B/groups/OLMo-1.7-7B-anneal) -->
|
106 |
|
107 |
|
|
|
108 |
## Evaluation
|
109 |
+
Core model results for OLMo2 7B and 13B models are found below.
|
110 |
+
|
111 |
+
| Model | Train FLOPs | Average | ARC/C | HSwag | WinoG | MMLU | DROP | NQ | AGIEval | GSM8k | MMWLUPro | TriviaQA |
|
112 |
+
|-------------------|------------|---------|--------|--------|--------|-------|-------|-----|----------|--------|-----------|-----------|
|
113 |
+
| Gemma-2-9B | 4.4·10²³ | 52.9 | 89.5 | 87.3 | 78.8 | 70.6 | 63 | 38 | 57.3 | 1.1 | 42 | 0.9 |
|
114 |
+
| Llama-2-13B | 1.6·10²³ | 54.1 | 67.3 | 83.9 | 74.9 | 55.7 | 45.6 | 38.4 | 41.5 | 28.1 | 23.9 | 81.3 |
|
115 |
+
| Mistral-7B-v0.3 | n/a | 58.8 | 78.3 | 83.1 | 77.7 | 63.5 | 51.8 | 37.2 | 47.3 | 40.1 | 30 | 79.3 |
|
116 |
+
| Llama-3.1-8B | 7.2·10²³ | 61.8 | 79.5 | 81.6 | 76.6 | 66.9 | 56.4 | 33.9 | 51.3 | 56.5 | 34.7 | 80.3 |
|
117 |
+
| Mistral-Nemo-12B | n/a | 66.9 | 85.2 | 85.6 | 81.5 | 69.5 | 69.2 | 39.7 | 54.7 | 62.1 | 36.7 | 84.6 |
|
118 |
+
| Qwen-2.5-7B | 8.2·10²³ | 67.4 | 89.5 | 89.7 | 74.2 | 74.4 | 55.8 | 29.9 | 63.7 | 81.5 | 45.8 | 69.4 |
|
119 |
+
| Qwen-2.5-14B | 16.0·10²³ | 72.2 | 94 | 94 | 80 | 79.3 | 51.5 | 37.3 | 71 | 83.4 | 52.8 | 79.1 |
|
120 |
+
| StableLM-2-12B | 2.9·10²³ | 62.2 | 81.9 | 84.5 | 77.7 | 62.4 | 55.5 | 37.6 | 50.9 | 62 | 29.3 | 79.9 |
|
121 |
+
| Zamba-2-7B | n/c | 65.2 | 92.2 | 89.4 | 79.6 | 68.5 | 51.7 | 36.5 | 55.5 | 67.2 | 32.8 | 78.8 |
|
122 |
+
| Amber-7B | 0.5·10²³ | 35.2 | 44.9 | 74.5 | 65.5 | 24.7 | 26.1 | 18.7 | 21.8 | 4.8 | 11.7 | 59.3 |
|
123 |
+
| OLMo-7B | 1.0·10²³ | 38.3 | 46.4 | 78.1 | 68.5 | 28.3 | 27.3 | 24.8 | 23.7 | 9.2 | 12.1 | 64.1 |
|
124 |
+
| MAP-Neo-7B | 2.1·10²³ | 49.6 | 78.4 | 72.8 | 69.2 | 58 | 39.4 | 28.9 | 45.8 | 12.5 | 25.9 | 65.1 |
|
125 |
+
| OLMo-0424-7B | 0.9·10²³ | 50.7 | 66.9 | 80.1 | 73.6 | 54.3 | 50 | 29.6 | 43.9 | 27.7 | 22.1 | 58.8 |
|
126 |
+
| DCLM-7B | 1.0·10²³ | 56.9 | 79.8 | 82.3 | 77.3 | 64.4 | 39.3 | 28.8 | 47.5 | 46.1 | 31.3 | 72.1 |
|
127 |
+
| **OLMo-2-1124-7B** | 1.8·10²³ | 62.9 | 79.8 | 83.8 | 77.2 | 63.7 | 60.8 | 36.9 | 50.4 | 67.5 | 31 | 78 |
|
128 |
+
| **OLMo-2-1124-13B** | 4.6·10²³ | 68.3 | 83.5 | 86.4 | 81.5 | 67.5 | 70.7 | 46.7 | 54.2 | 75.1 | 35.1 | 81.9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
|
130 |
## Model Details
|
131 |
|
132 |
+
### Pretraining
|
133 |
+
| | **OLMo 2 7B** | **OLMo 2 13B** |
|
134 |
+
|-------------------|------------|------------|
|
135 |
+
| Pretraining Stage 1<br>([OLMo-Mix-1124](https://huggingface.co/datasets/allenai/olmo-mix-1124)) | 4 trillion tokens<br>(1 epoch) | 5 trillion tokens<br>(1.2 epochs) |
|
136 |
+
| Pretraining Stage 2<br>([Dolmino-Mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124)) | 50B tokens (3 runs)<br>*merged* | 100B tokens (3 runs)<br>300B tokens (1 run)<br>*merged* |
|
137 |
+
| Post-training<br>([Tulu 3 SFT OLMo mix](https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-mixture)) | SFT + DPO + PPO<br>([preference mix](https://huggingface.co/datasets/allenai/olmo-2-1124-7b-preference-mix)) | SFT + DPO + PPO<br>([preference mix](https://huggingface.co/datasets/allenai/olmo-2-1124-13b-preference-mix)) |
|
138 |
+
|
139 |
+
#### Stage 1: Initial Pretraining
|
140 |
+
- Dataset: [OLMo-Mix-1124](https://huggingface.co/datasets/allenai/olmo-mix-1124) (3.9T tokens)
|
141 |
+
- Coverage: 90%+ of total pretraining budget
|
142 |
+
- 7B Model: ~1 epoch
|
143 |
+
- 13B Model: 1.2 epochs (5T tokens)
|
144 |
+
|
145 |
+
#### Stage 2: Fine-tuning
|
146 |
+
- Dataset: [Dolmino-Mix-1124](https://huggingface.co/datasets/allenai/dolmino-mix-1124) (843B tokens)
|
147 |
+
- Three training mixes:
|
148 |
+
- 50B tokens
|
149 |
+
- 100B tokens
|
150 |
+
- 300B tokens
|
151 |
+
- Mix composition: 50% high-quality data + academic/Q&A/instruction/math content
|
152 |
+
|
153 |
+
#### Model Merging
|
154 |
+
- 7B Model: 3 versions trained on 50B mix, merged via model souping
|
155 |
+
- 13B Model: 3 versions on 100B mix + 1 version on 300B mix, merged for final checkpoint
|
156 |
|
157 |
|
158 |
## Bias, Risks, and Limitations
|