Stas Bekman

stas

AI & ML interests

Toolmaker. Software creator, optimizer and harmonizer. Makes things work and fly at Contextual.AI Training LLM/RAG/Generative AI/Machine Learning/Scalability

Recent Activity

updated a model about 1 month ago
stas/ml-engineering-book
View all activity

Articles

Organizations

stas's activity

updated a model about 1 month ago
upvoted an article 3 months ago
view article
Article

Mixture of Experts Explained

196
New activity in tatsu-lab/alpaca_eval 3 months ago

Fix FileNotFoundError

3
#2 opened 3 months ago by lhoestq
New activity in HuggingFaceFW/fineweb 4 months ago

Casting Issue?

4
#40 opened 5 months ago by FelixLabelle
posted an update 5 months ago
view post
Post
1082
The Universal Checkpointing paper is out! https://arxiv.org/abs/2406.18820

If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.

Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.

The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md
upvoted an article 5 months ago
view article
Article

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

44
upvoted an article 7 months ago
view article
Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

166
New activity in stas/ml-engineering-book 8 months ago

Upload book cover

1
#1 opened 8 months ago by julien-c

metadata: set license

1
#2 opened 8 months ago by julien-c
Reacted to their post with 🤗 8 months ago
view post
Post
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.

This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.

Check out their post here: https://pytorch.org/blog/maximizing-training/
posted an update 9 months ago
view post
Post
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.

This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.

Check out their post here: https://pytorch.org/blog/maximizing-training/
posted an update 9 months ago
replied to their post 10 months ago
view reply

I pinged Elio to see if he wants to join.

posted an update 10 months ago
view post
Post
Hear, hear, AMD MI300Xs have started to emerge much sooner than expected.

Here is a 2-part benchmarks report on performing BLOOM-176B inference using @MSFTDeepSpeed optimized for AMD MI300X.

1. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing
2. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing-part-2

This was published in response to our BLOOM-176B super-fast inference blog post https://huggingface.co/blog/bloom-inference-pytorch-scripts

Note that these have 192GB of HBM!

The NVIDIA monopoly is strong, but it'll have to start sharing the pie and hopefully drive the costs down at least somewhat.

Thanks to https://www.linkedin.com/in/eliovp for sharing this writeup with me.

p.s. at the PyTorch conference in the fall, the AMD representative said we will see MI300X available to us mortals in Q4-2024/Q1-2025.
·
replied to their post 10 months ago
view reply

Thank you for the kind words, Jeff!

We are still waiting for BLOOM v2.0 from HF!

posted an update 10 months ago