Any idea for slender matrix multiply optimization?

202476410arsmart · July 15, 2022, 3:08am

Hi! I am developing slender matrix multiply and follow cutlass’s GEMM algorithm, also follow their suggestion, to change the kernel size.

My matrix is of the size 30720 * 3072 @ 3072 * 128, and I find out, the best kernel size is 128 * 128.

Later I find out change the L2 cache policy is benificial.

So I am measuring my performance

From google’s page, I learnt my turing 1650’s fp32 peak performance is 3.2TFLOPS
$_J3CU~~B0I1{{LY%1)LGFCQ$

My kernel’s time spending is 8.73ms, so 2307203072*128/(8.73e-3)≈2.76e12(calculation/second)
So it is, 2.76e12/3.2e12≈86.25%？

So, seems still have improved space. For my slender matrix multiply, cutlass does not provide too much suggestions. I tired to find some papers, but those papers’ performance is actually much more slow than cutlass…or maybe only fast when verrrrry slender?

So could someone kindly provide some ideas? Maybe…Using stream to split calculation? sliceK? Splitk?(But I have tried them and become slower…not sure maybe my code’s incorrect implementation…)

Thank you!!!

Topic		Replies	Views
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10127	March 24, 2009
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7521	September 1, 2010
performance of the matrix multiplication CUDA Programming and Performance	16	4393	September 11, 2013
Perforating the NVIDIA Cutlass K loop for approximate matrix multiply CUDA Programming and Performance	0	437	August 8, 2019
Optimize problem regarding problem size CUDA Programming and Performance	4	6140	May 25, 2011
Need Help Optimizing Kernel CUDA Programming and Performance	6	1081	January 7, 2014
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	997	September 6, 2023
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	633	April 21, 2023
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3744	April 4, 2007
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18215	March 30, 2011

Any idea for slender matrix multiply optimization?

Related topics