what will happen if the block_size is not multiple 32x?

As I read and heard that the thread size assigned to a block should be always multiple of the warp size(32), otherwise the performance is dropped. But I am not clear why the performance is dropped.

In my condition, every thread processes one element, the number of elements in one task is K and K is always littler than 1024. This is to say that K can be any 1~1024.
Question:

  • Can I set bloke_size to K?
  • Is it better than set block_size to (with (K + 31) >> 5) << 5?

Set block size to K. This saves you from having to implement a range check in the code.

The last warp in the block will have some non-participating threads (i.e. a part of that warp has unused capacity). This is why it’s said to lower performance. If the block contains several complete warps as well then the performance drop is small. I would not worry about performance losses if K larger than 128 most of the time. Do you have an idea what the mean K value is?

@cbuchner1 Thanks for your reply. K can be an arbitrary number between 1~1024, and you can think that the distribution is uniform.
What if the K is less than 128 most of time? What if the K is greater than 128 most of time?

96 <= K <= 128: percentage lost due to idle CUDA cores may be between 0 and 25%

32 <= K <= 64: percentage lost due to idle CUDA cores may be between 0 and 50%. Also it may be hard to get enough occupancy on the device with such small blocks.

I figure that K >= 128 is enough threads to no longer consider non-participating CUDA cores a significant slowdown (always less than 20% performance loss)

@cbuchner1 Thank you very much for your explanation. Let’s have a specific case, if K is equal to 100, the block_size is arranged in two ways,

  • one is to set the block_size to 100, the first 3 warp is saturated and the last warp only has 4 threads;
  • the other is to set the block_size to 128, the first 3 warp is saturated and the last warp only has 4 threads to work.
    So which of these two situations will be more efficient? Because the number of threads that actually work in these two situations is the same, and all are 100.

The version with block dimension 100 is slightly more efficient, as a block size of 128 would require you to add extra instructions to compare threadIdx.x with K (a range or bounds check).

I see, thank you!!!