Cuda matrix-multiply, memory coalescing (corner turning)

xll · June 9, 2021, 1:11am

When I use CUDA to speedup the matrix-multiply using global memory and shared memory, I want to test the performance impact of memory coalescing. So I write two paired kernel to test the influence of value by row and value by column.
mulmatrix.cu (7.4 KB)

For the first pair of kernel functions(kernel_globalx,kernel_globaly), the results seem to be in line with expectations(kernel_globalx is faster than kernel_globaly).

Theoretically, the execution speed of the second pair(kernel_shared1,kernel_shared2) should be the same, because they all used memory coalescing(reference “programming massively parallel processors” chapter05 page 111). but the test found that the kernel(kernel_shared1) performed faster than the kernel(kernel_shared2) second pair. Is it normal?

By the way, refer to the book, it says “On a current generation device, the tiled kernel can run more
than 30x faster than the simple kernel”, But I can only get about 3x faster than the simple kernel. Is there anything to imporve in my kernel function(kernel_shared)?

The device I use is TITAN Xp(NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 )

Topic		Replies	Views
Cuda matrix-multiply, memory coalescing (corner turning) Legacy PGI Compilers cuda	2	1441	June 9, 2021
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1242	April 26, 2013
Matrix multiplication CUDA CUDA Programming and Performance	7	3024	November 12, 2012
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10733	February 8, 2008
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	623	July 12, 2011
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2167	October 7, 2008
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3567	July 12, 2011
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3726	June 27, 2011
Vector matrix multiplication CUDA Programming and Performance	5	6181	November 30, 2011
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7217	September 6, 2011

Cuda matrix-multiply, memory coalescing (corner turning)

Related topics