While profiling FP32 SGEMM performance on the RTX 5090, I noticed that cublasSgemmStridedBatched dispatches the same cutlass_80_simt_sgemm_128x32_8x5 kernel for every batched workload from 256×256 to 8192×8192×8, running at ~40% FMA pipe utilization across the entire size range. The dispatcher does not escalate to a larger tile at any threshold.
The same libcublas.so binary (cuBLAS 13.3.0, CUDA 13.2.51, driver 595.58.03) correctly escalates on other sm_120 and sm_90 GPUs:
- RTX PRO 6000 Blackwell (sm_120): escalates through simt_128x64 → simt_128x128 → simt_256x128, reaching 73% FMA pipe utilization at 4096+
- H200 (sm_90): mixes CUTLASS simt_256x128 and xmma_gemm_128x128x8 families, reaching 82% FMA pipe utilization
Reproducing:
I’ve published a full write-up with per-size ncu data across all three GPUs, along with repro scripts: https://cloudrift.ai/blog/beating-cublas-on-rtx-5090/
This appears to be the same class of bug reported previously:
- Pascal card calling Maxwell kernels (2018)
- cuBLAS sgemm is slow (2017)
Environment:
- GPU: NVIDIA GeForce RTX 5090 (GB202, sm_120, 170 SMs)
- Driver: 595.58.03
- CUDA: 13.2.51
- cuBLAS: 13.3.0
- OS: Ubuntu 24.04
I have not tested on other consumer RTX GPUs (5070, 5080, 4090) but the dispatch logic appears to be arch-specific, so similar issues may exist on those paths.