Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B?

Description

Hi

As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:

  • Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?
  • If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?
  • For 2:4 sparsity, are there strict requirements on grouping axis (along the KKK dimension) and on weight packing/reordering, where violations would cause a fallback to dense kernels?
  • Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?

Environment

NVIDIA GeForce RTX 4090, 8.9, P2`

=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35

=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)

=== cuBLASLt ===
cuBLASLt version: 0

=== TensorRT ===
TensorRT not installed

Relevant Files

2to4_sparsity.zip (4.7 KB)

Thanks!