YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DeepGEMM

DeepGEMM kernel for the Hugging Face kernel-builder infrastructure.

This package provides FP8/FP4/BF16 GEMM kernels, einsum, attention, and hyperconnection operations from DeepSeek-AI/DeepGEMM, adapted to the kernels-community build structure with torch library bindings.

Features

  • FP8/FP4 GEMMs: NT, NN, TN, TT variants with M-grouped and K-grouped support
  • BF16 GEMMs: NT, NN, TN, TT variants with M-grouped and K-grouped support
  • cuBLASLt GEMMs: NT, NN, TN, TT wrappers
  • Einsum: bmk,bnk->mn, bhr,hdr->bhd, bhd,hdr->bhr expressions (BF16 and FP8)
  • Attention: FP8 MQA logits (regular and paged)
  • Hyperconnection: TF32 prenorm GEMM
  • Layout utilities: Scaling factor transformations, TMA alignment

Architecture Support

  • SM 9.0a (Hopper / H100)
  • SM 10.0a (Blackwell / B200)

Requirements

  • CUDA >= 12.1
  • PyTorch >= 2.1
  • CUTLASS 3.9+
  • NVRTC (part of CUDA Toolkit)

Installation

pip install kernels
import kernels
kernels.install("kernels-community/DeepGEMM")

Usage

import deep_gemm

# FP8 GEMM: D = A @ B.T
deep_gemm.fp8_gemm_nt((a_fp8, sfa), (b_fp8, sfb), d)

# BF16 GEMM: D = A @ B.T
deep_gemm.bf16_gemm_nt(a_bf16, b_bf16, d)

# cuBLASLt GEMM
deep_gemm.cublaslt_gemm_nt(a, b, d)

JIT Compilation

DeepGEMM uses Just-In-Time (JIT) compilation for its CUDA kernels. The kernel templates (.cuh files in include/deep_gemm/) are compiled at runtime using NVCC or NVRTC. First invocations may be slower due to compilation; results are cached in ~/.deep_gemm/ for subsequent calls.

CUTLASS Runtime Dependency

The JIT-compiled kernels depend on CUTLASS headers (cute/, cutlass/) at runtime. The package will automatically search for CUTLASS in these locations:

  1. DG_CUTLASS_INCLUDE environment variable (direct path to include dir)
  2. CUTLASS_HOME environment variable ($CUTLASS_HOME/include)
  3. Bundled in the package's include/ directory
  4. CUDA_HOME/include (some CUDA 12.8+ installs bundle cute/)
  5. nvidia-cutlass Python package

Set one of these if JIT compilation fails with missing CUTLASS headers:

export CUTLASS_HOME=/path/to/cutlass
# or
export DG_CUTLASS_INCLUDE=/path/to/cutlass/include
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support