Hi!
I’m quite new to CUDA and I got started with implementing a matrix multiplication kernel using shared memory, similiar to the MatMulKernel example in the CUDA programming guide (pp. 23).
Although I don’t see much differences between both kernels, my version is slower; especially when dimensions of the matrices grow.
The profiler shows a high warp serialize rate warning when running my kernel with larger data. This appears to be the problem, but unfortunately I don’t know how to fix this. As far as I know the threads of a warp have to be serialized when their execution paths are diverging, but their are no divergent branches in my kernel (divergent_branch field in profiler output is 0).
Can threads be serialized for other reasons as well?
Any hints would be appreciated.
This is the source of the kernel:
[codebox]/* PROFILER OUTPUT FOR DIM_X=BLOCK_SIZE*3 AND DIM_Y=BLOCK_SIZE*4 */method, gputime, cputime, occupancy, gridSizeX, gridSizeY, blockSizeX, blockSizeY, blockSizeZ, dynSmemPerBlock, staSmemPerBlock, registerPerThread, streamID, gld_incoherent, gld_coherent, gst_incoherent, gst_coherent, local_load, local_store, branch, divergent_branch, instructions, warp_serialize, cta_launched
Z12MatMulKernel6MatrixS_S, 7.84, 27.098, 0.667, 4, 4, 16, 16, 1, 0, 2112, 12, 0, 0, 192, 0, 128, 0, 0, 96, 0, 1719, 0, 2
Z11myMatrixMul6MatrixS_S, 21.728, 42.184, 0.667, 4, 4, 16, 16, 1, 0, 2112, 11, 0, 0, 192, 0, 128, 0, 0, 104, 0, 1615, 13671, 2
Occupancy analysis for kernel ‘myMatrixMul’ for device ‘BLOCK_SIZE3 x BLOCK_SIZE4 : Device0’ :
Kernel details : Grid size: 4 x 4, Block size: 16 x 16 x 1
Register Ratio = 0.6875 ( 5632 / 8192 ) [11 registers per thread]
Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]
Active Blocks per SM = 2 : 8
Active threads per SM = 512 : 768
Occupancy = 0.666667 ( 16 / 24 )
Occupancy limiting factor = Registers
Occupancy analysis for kernel ‘MatMulKernel’ for device ‘BLOCK_SIZE3 x BLOCK_SIZE4 : Device0’ :
Kernel details : Grid size: 4 x 4, Block size: 16 x 16 x 1
Register Ratio = 0.75 ( 6144 / 8192 ) [12 registers per thread]
Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]
Active Blocks per SM = 2 : 8
Active threads per SM = 512 : 768
Occupancy = 0.666667 ( 16 / 24 )
Occupancy limiting factor = Registers
Analyze profiler data for session ‘BLOCK_SIZE3 x BLOCK_SIZE4’ …
High warp serialize rate of 629.188 for method 'myMatrixMul
[/codebox]
Profiler output for an uncrtitical data configuration:
[codebox]/* PROFILER OUTPUT FOR DIM_X=BLOCK_SIZE2 AND DIM_Y=BLOCK_SIZE3 */
method, gputime, cputime, occupancy, gridSizeX, gridSizeY, blockSizeX, blockSizeY, blockSizeZ, dynSmemPerBlock, staSmemPerBlock, registerPerThread, streamID, gld_incoherent, gld_coherent, gst_incoherent, gst_coherent, local_load, local_store, branch, divergent_branch, instructions, warp_serialize, cta_launched
Z12MatMulKernel6MatrixS_S, 5.632, 24.863, 0.667, 3, 3, 16, 16, 1, 0, 2112, 12, 0, 0, 64, 0, 64, 0, 0, 72, 0, 1327, 0, 1
Z11myMatrixMul6MatrixS_S, 14.944, 34.641, 0.667, 3, 3, 16, 16, 1, 0, 2112, 11, 0, 0, 64, 0, 64, 0, 0, 0, 0, 0, 0, 1
Occupancy analysis for kernel ‘myMatrixMul’ for device ‘BLOCK_SIZE2 x BLOCK_SIZE3 : Device0’ :
Kernel details : Grid size: 3 x 3, Block size: 16 x 16 x 1
Register Ratio = 0.6875 ( 5632 / 8192 ) [11 registers per thread]
Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]
Active Blocks per SM = 2 : 8
Active threads per SM = 512 : 768
Occupancy = 0.666667 ( 16 / 24 )
Occupancy limiting factor = Registers
Warning: Grid Size (9) is less than number of available SMs (16).
Occupancy analysis for kernel ‘MatMulKernel’ for device ‘BLOCK_SIZE2 x BLOCK_SIZE3 : Device0’ :
Kernel details : Grid size: 3 x 3, Block size: 16 x 16 x 1
Register Ratio = 0.75 ( 6144 / 8192 ) [12 registers per thread]
Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]
Active Blocks per SM = 2 : 8
Active threads per SM = 512 : 768
Occupancy = 0.666667 ( 16 / 24 )
Occupancy limiting factor = Registers
Warning: Grid Size (9) is less than number of available SMs (16).
Analyze profiler data for session ‘BLOCK_SIZE2 x BLOCK_SIZE3’ …
(nothing)
[/codebox]