Cuda matrix-multiply, memory coalescing (corner turning)

xll · June 8, 2021, 9:03am

When I use CUDA to speedup the matrix-multiply using global memory and shared memory, I want to test the performance impact of memory coalescing. So I write two paired kernel to test the influence of value by row and value by column.

mulmatrix.cu (7.4 KB)
For the first pair of kernel functions(kernel_globalx,kernel_globaly), the results seem to be in line with expectations(kernel_globalx is faster than kernel_globaly).

Theoretically, the execution speed of the second pair(kernel_shared1,kernel_shared2) should be the same, because they all used memory coalescing(reference “programming massively parallel processors” chapter05 page 111). but the test found that the kernel(kernel_shared1) performed faster than the kernel(kernel_shared2) second pair. Is it normal?

By the way, refer to the book, it says “On a current generation device, the tiled kernel can run more
than 30x faster than the simple kernel”, But I can only get about 3x faster than the simple kernel. Is there anything to imporve in my kernel function(kernel_shared)?

The device I use is TITAN Xp(NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 )

MatColgrove · June 8, 2021, 5:50pm

Hi xll,

I’d recommend posting CUDA C questions on the CUDA Forum: CUDA - NVIDIA Developer Forums

Thanks,
Mat

xll · June 9, 2021, 1:08am

ok, thanks

Topic		Replies	Views
Cuda matrix-multiply, memory coalescing (corner turning) CUDA Programming and Performance	0	812	June 9, 2021
Matrix multiplication CUDA CUDA Programming and Performance	7	3040	November 12, 2012
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1247	April 26, 2013
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10738	February 8, 2008
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2171	October 7, 2008
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3731	June 27, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	626	July 12, 2011
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3574	July 12, 2011
Vector matrix multiplication CUDA Programming and Performance	5	6189	November 30, 2011
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7226	September 6, 2011

Cuda matrix-multiply, memory coalescing (corner turning)

Related topics