Having problems with warp divergence/serialization profiler: high warp serialize rate although diver

Hi!

I’m quite new to CUDA and I got started with implementing a matrix multiplication kernel using shared memory, similiar to the MatMulKernel example in the CUDA programming guide (pp. 23).

Although I don’t see much differences between both kernels, my version is slower; especially when dimensions of the matrices grow.

The profiler shows a high warp serialize rate warning when running my kernel with larger data. This appears to be the problem, but unfortunately I don’t know how to fix this. As far as I know the threads of a warp have to be serialized when their execution paths are diverging, but their are no divergent branches in my kernel (divergent_branch field in profiler output is 0).

Can threads be serialized for other reasons as well?

Any hints would be appreciated.

This is the source of the kernel:

[codebox]/* PROFILER OUTPUT FOR DIM_X=BLOCK_SIZE*3 AND DIM_Y=BLOCK_SIZE*4 */

method, gputime, cputime, occupancy, gridSizeX, gridSizeY, blockSizeX, blockSizeY, blockSizeZ, dynSmemPerBlock, staSmemPerBlock, registerPerThread, streamID, gld_incoherent, gld_coherent, gst_incoherent, gst_coherent, local_load, local_store, branch, divergent_branch, instructions, warp_serialize, cta_launched

Z12MatMulKernel6MatrixS_S, 7.84, 27.098, 0.667, 4, 4, 16, 16, 1, 0, 2112, 12, 0, 0, 192, 0, 128, 0, 0, 96, 0, 1719, 0, 2

Z11myMatrixMul6MatrixS_S, 21.728, 42.184, 0.667, 4, 4, 16, 16, 1, 0, 2112, 11, 0, 0, 192, 0, 128, 0, 0, 104, 0, 1615, 13671, 2

Occupancy analysis for kernel ‘myMatrixMul’ for device ‘BLOCK_SIZE3 x BLOCK_SIZE4 : Device0’ :

Kernel details : Grid size: 4 x 4, Block size: 16 x 16 x 1

Register Ratio = 0.6875 ( 5632 / 8192 ) [11 registers per thread]

Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 512 : 768

Occupancy = 0.666667 ( 16 / 24 )

Occupancy limiting factor = Registers

Occupancy analysis for kernel ‘MatMulKernel’ for device ‘BLOCK_SIZE3 x BLOCK_SIZE4 : Device0’ :

Kernel details : Grid size: 4 x 4, Block size: 16 x 16 x 1

Register Ratio = 0.75 ( 6144 / 8192 ) [12 registers per thread]

Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 512 : 768

Occupancy = 0.666667 ( 16 / 24 )

Occupancy limiting factor = Registers

Analyze profiler data for session ‘BLOCK_SIZE3 x BLOCK_SIZE4’ …

High warp serialize rate of 629.188 for method 'myMatrixMul

[/codebox]

Profiler output for an uncrtitical data configuration:

[codebox]/* PROFILER OUTPUT FOR DIM_X=BLOCK_SIZE2 AND DIM_Y=BLOCK_SIZE3 */

method, gputime, cputime, occupancy, gridSizeX, gridSizeY, blockSizeX, blockSizeY, blockSizeZ, dynSmemPerBlock, staSmemPerBlock, registerPerThread, streamID, gld_incoherent, gld_coherent, gst_incoherent, gst_coherent, local_load, local_store, branch, divergent_branch, instructions, warp_serialize, cta_launched

Z12MatMulKernel6MatrixS_S, 5.632, 24.863, 0.667, 3, 3, 16, 16, 1, 0, 2112, 12, 0, 0, 64, 0, 64, 0, 0, 72, 0, 1327, 0, 1

Z11myMatrixMul6MatrixS_S, 14.944, 34.641, 0.667, 3, 3, 16, 16, 1, 0, 2112, 11, 0, 0, 64, 0, 64, 0, 0, 0, 0, 0, 0, 1

Occupancy analysis for kernel ‘myMatrixMul’ for device ‘BLOCK_SIZE2 x BLOCK_SIZE3 : Device0’ :

Kernel details : Grid size: 3 x 3, Block size: 16 x 16 x 1

Register Ratio = 0.6875 ( 5632 / 8192 ) [11 registers per thread]

Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 512 : 768

Occupancy = 0.666667 ( 16 / 24 )

Occupancy limiting factor = Registers

Warning: Grid Size (9) is less than number of available SMs (16).

Occupancy analysis for kernel ‘MatMulKernel’ for device ‘BLOCK_SIZE2 x BLOCK_SIZE3 : Device0’ :

Kernel details : Grid size: 3 x 3, Block size: 16 x 16 x 1

Register Ratio = 0.75 ( 6144 / 8192 ) [12 registers per thread]

Shared Memory Ratio = 0.3125 ( 5120 / 16384 ) [2112 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 512 : 768

Occupancy = 0.666667 ( 16 / 24 )

Occupancy limiting factor = Registers

Warning: Grid Size (9) is less than number of available SMs (16).

Analyze profiler data for session ‘BLOCK_SIZE2 x BLOCK_SIZE3’ …

(nothing)

[/codebox]

warp serialize == bank conflict

Serialization is from bank conflicts, specifically these:

a[threadIdx.x][threadIdx.y]

b[threadIdx.x][threadIdx.y]

b[threadIdx.x][e]

Sequential threads within the same warp are accessing locations that are exactly BLOCK_SIZE apart, which will cause bank conflicts if BLOCK_SIZE is a multiple of 16.

Try declaring your shared memory slightly differently (similar to in the matrix transpose example in the SDK):

__shared__ float a[BLOCK_SIZE][BLOCK_SIZE+1];

  __shared__ float b[BLOCK_SIZE][BLOCK_SIZE+1];

This causes sequential threads to access locations that are BLOCK_SIZE + 1 apart, and if BLOCK_SIZE is a multiple of 16, then all the threads in a half warp will access different banks.

Another option would be to completely transpose your a and b matrices:

a[threadIdx.y][threadIdx.x] = A.elements[posA.y * A.width + posA.x];

	b[threadIdx.y][threadIdx.x] = B.elements[posB.y * B.width + posB.x];

	__syncthreads();

	// iterate over a row in the copied sub matrix a (a column in sub matrix b)

	for (int e = 0; e < BLOCK_SIZE; ++e)

	  cValue += a[threadIdx.y][e] * b[e][threadIdx.x];

Thanks for your replies!
Transposing the submatrices as described worked out. There are no more warp serialization problems when executing the kernel.

Hi,

I found this thread an I am having the same problem. I tried the original “problem” code from above and got 91% warp serialize. Ok, that’s clear. Then I tried the suggested version an got a warp serialize of about 47.4%.

This is exactly the warp serialize I get for my own implementation.

Why is the warp serialization still that high?

Best regards,

gemini