cuBLAS severe underperformance on cublasSgemm for RTX 3060 Laptop GPU

lassi.kokkonen · May 12, 2026, 8:21am

I set up a simple benchmark for 10000x10000 float matrix multiplication on cublasSgemm and was suprised to find out that fastest Transpose option took 7.388s. That is equivalent to throught put of 0.271 TFlops, which is far from the advertised theoretical maximum of ~12.7 TFlops. nvcc and drivers both use the latest version 13.2.

Nsight gives:

NN
void cutlass::Kernel2<cutlass_80_simt_sgemm_256x128_8x4_nn_align1>(T1::Params)
+10,470 s

TN
void cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_nt_align1>(T1::Params)
+7,388 s (0.271 TFlops)

NT
ampere_sgemm_128x128_tn
+21,247 s (0.0941 TFlops)

TT
void cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_tt_align1>(T1::Params)
+10,673 s

cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_nt_align1>(T1::Params)
Begins: 12,8959s
Ends: 20,2839s (+7,388 s)
grid: <<<320, 10, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 49 152 bytes
Registers Per Thread: 208
Local Memory Per Thread: 0 bytes
Local Memory Total: 26 542 080 bytes
Shared Memory executed: 65 536 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 16,6667 %
Cluster X: 0
Cluster Y: 0
Cluster Z: 0
Cluster Scheduling Policy: 0
Max Potential Cluster Size: 0
Max Active Clusters: 0
Launched from thread: 279
Latency: <-3,759 ms
Correlation ID: 1443
Stream: Default stream 7

Robert_Crovella · May 14, 2026, 4:08pm

A laptop GPU (really: all GPUs) will be potentially limited by power or thermal characteristics, or both. You can use nvidia-smi while your test is running to monitor either or both (power and/or thermal profile/behavior). It’s also possible that there is variance, perhaps due to different power or thermal designs or thermal attach, from one unit to another. In particular, the GPU will perform poorly if its thermal solution is poor or compromised. Here is an example of a related investigation.

Topic		Replies	Views
cublas sgemm,dgemm performance issue on telsa 10 and gtx 570 GPU-Accelerated Libraries	1	1340	February 24, 2013
cuBLAS GEMM 2.5 times slower on 4090 than on 3090? GPU-Accelerated Libraries cublas , curand	0	484	December 25, 2023
cuBLAS batched FP32 SGEMM dispatcher picks suboptimal kernel on RTX 5090 (sm_120) GPU-Accelerated Libraries cublas	0	70	April 10, 2026
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1086	February 23, 2017
Slow CUDA SGEMM CUDA Programming and Performance	5	826	September 15, 2022
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1519	February 12, 2010
Why is cuBLAS cublasDgemm slower than my naive GEMM kernel? GPU-Accelerated Libraries cuda , kernel , cublas , cutlass	1	138	September 15, 2025
cuBLAS launch 5 times threads blocks more than expected GPU-Accelerated Libraries cublas	3	532	April 11, 2024
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	669	April 21, 2023
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7721	February 3, 2008

cuBLAS severe underperformance on cublasSgemm for RTX 3060 Laptop GPU

Related topics