I set up a simple benchmark for 10000x10000 float matrix multiplication on cublasSgemm and was suprised to find out that fastest Transpose option took 7.388s. That is equivalent to throught put of 0.271 TFlops, which is far from the advertised theoretical maximum of ~12.7 TFlops. nvcc and drivers both use the latest version 13.2.
Nsight gives:
NN
void cutlass::Kernel2<cutlass_80_simt_sgemm_256x128_8x4_nn_align1>(T1::Params)
+10,470 s
TN
void cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_nt_align1>(T1::Params)
+7,388 s (0.271 TFlops)
NT
ampere_sgemm_128x128_tn
+21,247 s (0.0941 TFlops)
TT
void cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_tt_align1>(T1::Params)
+10,673 s
cutlass::Kernel2<cutlass_80_simt_sgemm_128x256_8x4_nt_align1>(T1::Params)
Begins: 12,8959s
Ends: 20,2839s (+7,388 s)
grid: <<<320, 10, 1>>>
block: <<<256, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 49 152 bytes
Registers Per Thread: 208
Local Memory Per Thread: 0 bytes
Local Memory Total: 26 542 080 bytes
Shared Memory executed: 65 536 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 16,6667 %
Cluster X: 0
Cluster Y: 0
Cluster Z: 0
Cluster Scheduling Policy: 0
Max Potential Cluster Size: 0
Max Active Clusters: 0
Launched from thread: 279
Latency: <-3,759 ms
Correlation ID: 1443
Stream: Default stream 7