cuBLAS batched FP32 SGEMM dispatcher picks suboptimal kernel on RTX 5090 (sm_120)

While profiling FP32 SGEMM performance on the RTX 5090, I noticed that cublasSgemmStridedBatched dispatches the same cutlass_80_simt_sgemm_128x32_8x5 kernel for every batched workload from 256×256 to 8192×8192×8, running at ~40% FMA pipe utilization across the entire size range. The dispatcher does not escalate to a larger tile at any threshold.

The same libcublas.so binary (cuBLAS 13.3.0, CUDA 13.2.51, driver 595.58.03) correctly escalates on other sm_120 and sm_90 GPUs:

  • RTX PRO 6000 Blackwell (sm_120): escalates through simt_128x64 → simt_128x128 → simt_256x128, reaching 73% FMA pipe utilization at 4096+
  • H200 (sm_90): mixes CUTLASS simt_256x128 and xmma_gemm_128x128x8 families, reaching 82% FMA pipe utilization

Reproducing:

I’ve published a full write-up with per-size ncu data across all three GPUs, along with repro scripts: https://cloudrift.ai/blog/beating-cublas-on-rtx-5090/

This appears to be the same class of bug reported previously:

  • Pascal card calling Maxwell kernels (2018)
  • cuBLAS sgemm is slow (2017)

Environment:

  • GPU: NVIDIA GeForce RTX 5090 (GB202, sm_120, 170 SMs)
  • Driver: 595.58.03
  • CUDA: 13.2.51
  • cuBLAS: 13.3.0
  • OS: Ubuntu 24.04

I have not tested on other consumer RTX GPUs (5070, 5080, 4090) but the dispatch logic appears to be arch-specific, so similar issues may exist on those paths.

1 Like