Cuda matrix-multiply, memory coalescing (corner turning)

When I use CUDA to speedup the matrix-multiply using global memory and shared memory, I want to test the performance impact of memory coalescing. So I write two paired kernel to test the influence of value by row and value by column.
mulmatrix.cu (7.4 KB)

For the first pair of kernel functions(kernel_globalx,kernel_globaly), the results seem to be in line with expectations(kernel_globalx is faster than kernel_globaly).

Theoretically, the execution speed of the second pair(kernel_shared1,kernel_shared2) should be the same, because they all used memory coalescing(reference “programming massively parallel processors” chapter05 page 111). but the test found that the kernel(kernel_shared1) performed faster than the kernel(kernel_shared2) second pair. Is it normal?

By the way, refer to the book, it says “On a current generation device, the tiled kernel can run more
than 30x faster than the simple kernel”, But I can only get about 3x faster than the simple kernel. Is there anything to imporve in my kernel function(kernel_shared)?

The device I use is TITAN Xp(NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 )