Description
Hi
As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:
- Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?
- If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?
- For 2:4 sparsity, are there strict requirements on grouping axis (along the KKK dimension) and on weight packing/reordering, where violations would cause a fallback to dense kernels?
- Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?
Environment
NVIDIA GeForce RTX 4090, 8.9, P2`
=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)=== cuBLASLt ===
cuBLASLt version: 0=== TensorRT ===
TensorRT not installed
Relevant Files
2to4_sparsity.zip (4.7 KB)
Thanks!
