Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B?

LaoHao · August 1, 2025, 8:06am

Description

Hi

As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:

Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?
If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?
For 2:4 sparsity, are there strict requirements on grouping axis (along the KKK dimension) and on weight packing/reordering, where violations would cause a fallback to dense kernels?
Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?

Environment

NVIDIA GeForce RTX 4090, 8.9, P2`

=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35

=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)

=== cuBLASLt ===
cuBLASLt version: 0

=== TensorRT ===
TensorRT not installed

Relevant Files

2to4_sparsity.zip (4.7 KB)

Thanks!

Topic		Replies	Views
Accelerating Sparsity for GEMM GPU-Accelerated Libraries cusparse	4	828	May 18, 2022
Sparse tensor math speedup on Ampere TensorRT tensorrt , cuda	1	433	December 20, 2023
Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt Technical Blog	10	1397	March 14, 2022
Deep Learning model in TensorRT with SPARSE layers not accelerating on A40 TensorRT tensorrt , yolo , onnx , deep-learning	0	136	August 7, 2024
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT Technical Blog	13	3026	June 2, 2023
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	660	April 21, 2023
When will a op with 2:4 Sparsity Weight call sparse_conv kernel on TensorRT? TensorRT tensorrt	2	646	May 17, 2023
Stuctured sparsity 2:4 does not improve inference performance on Jetson Orin TensorRT tensorrt	6	1046	October 17, 2023
Efficient computation for Sparse Matrix ( with CRS format ) ? Jetson TX1	4	1386	October 18, 2021
Dense vs Sparse Tensor Core Performance (FP16) CUDA Programming and Performance	10	2213	December 5, 2024

Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B?

Description

Environment

Relevant Files

Related topics