DeepGEMM
DeepGEMM kernel for the Hugging Face kernel-builder infrastructure.
This package provides FP8/FP4/BF16 GEMM kernels, einsum, attention, and hyperconnection operations from DeepSeek-AI/DeepGEMM, adapted to the kernels-community build structure with torch library bindings.
Features
- FP8/FP4 GEMMs: NT, NN, TN, TT variants with M-grouped and K-grouped support
- BF16 GEMMs: NT, NN, TN, TT variants with M-grouped and K-grouped support
- cuBLASLt GEMMs: NT, NN, TN, TT wrappers
- Einsum: bmk,bnk->mn, bhr,hdr->bhd, bhd,hdr->bhr expressions (BF16 and FP8)
- Attention: FP8 MQA logits (regular and paged)
- Hyperconnection: TF32 prenorm GEMM
- Layout utilities: Scaling factor transformations, TMA alignment
Architecture Support
- SM 9.0a (Hopper / H100)
- SM 10.0a (Blackwell / B200)
Requirements
- CUDA >= 12.1
- PyTorch >= 2.1
- CUTLASS 3.9+
- NVRTC (part of CUDA Toolkit)
Installation
pip install kernels
import kernels
kernels.install("kernels-community/DeepGEMM")
Usage
import deep_gemm
# FP8 GEMM: D = A @ B.T
deep_gemm.fp8_gemm_nt((a_fp8, sfa), (b_fp8, sfb), d)
# BF16 GEMM: D = A @ B.T
deep_gemm.bf16_gemm_nt(a_bf16, b_bf16, d)
# cuBLASLt GEMM
deep_gemm.cublaslt_gemm_nt(a, b, d)
JIT Compilation
DeepGEMM uses Just-In-Time (JIT) compilation for its CUDA kernels. The kernel
templates (.cuh files in include/deep_gemm/) are compiled at runtime using
NVCC or NVRTC. First invocations may be slower due to compilation; results are
cached in ~/.deep_gemm/ for subsequent calls.
CUTLASS Runtime Dependency
The JIT-compiled kernels depend on CUTLASS headers (cute/, cutlass/) at
runtime. The package will automatically search for CUTLASS in these locations:
DG_CUTLASS_INCLUDEenvironment variable (direct path to include dir)CUTLASS_HOMEenvironment variable ($CUTLASS_HOME/include)- Bundled in the package's
include/directory CUDA_HOME/include(some CUDA 12.8+ installs bundlecute/)nvidia-cutlassPython package
Set one of these if JIT compilation fails with missing CUTLASS headers:
export CUTLASS_HOME=/path/to/cutlass
# or
export DG_CUTLASS_INCLUDE=/path/to/cutlass/include