Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
JawardΒ 
posted an update Apr 5
Post
2827
After giving GPU Programming a hands-on try, I have come to appreciate the level of complexity in AI compute:

- Existing/leading frameworks (CUDA, OpenCL, DSLs, even Triton), still fall at the mercy of low-level compute that requires deeper understanding and experience.
- Ambiguous optimizations methods that will literally drive you mad 🀯
- Triton is cool but not cool enough (high level abstractions that fall back to low level compute issues as you build more specialized kernels)
- As for CUDA, optimization requires considering all major components of the GPU (DRAM, SRAM, ALUs) πŸ€•
- Models today require stallion written GPU kernels to reduce storage and compute cost.
- GPTQ was a big save πŸ‘πŸΌ

@karpathy is right expertise in this area is scarce and the reason is quite obvious - uncertainties: we are still struggling to get peak performance from multi-connected GPUs while maintaining precision and reducing cost.

May the Scaling Laws favor us lol.

any good resources you'd recommend on getting started with the lower level stuff? I always assumed CUDA was a magic black box, but that looks like a nice view of the assembly?

Β·

The book Programming Massively Parallel Processors is great. Also, there's a discord server called Cuda Mode started by some core pytorch people. Has lectures every week and is also great.