leo-pekelis-gradient
commited on
Commit
•
404ab0b
1
Parent(s):
df6e96a
Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,8 @@ This model extends LLama-3 8B's context length from 8k to > 1040K, developed by
|
|
24 |
|
25 |
**Infra:**
|
26 |
|
27 |
-
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
|
|
|
28 |
|
29 |
**Data:**
|
30 |
|
|
|
24 |
|
25 |
**Infra:**
|
26 |
|
27 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
|
28 |
+
Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
|
29 |
|
30 |
**Data:**
|
31 |
|