# Elementwise kernel number of thread block

cuda programming guide says that to avoid tail effect the number of thread block should be equal or a little bit less than the multiple of the number of sm.(eg. 114 for Hopper architecture) I tested 65536 * 54 point element-wise add kernel like below. Because the maximum number of resident warps per SM is 64, I think when the blockDim is 256, more than 114 * 8 thread blocks are unneccesary. However, the execution time is much slower than 13864 thread blocks. Am I missing something??

I didnâ€™t see any kernel.

That statement isnâ€™t very clear to me. What exactly are you comparing? What are your measured execution times for each case?

However the tail effect statements are sensible to me. 65536x54 is around 3.5M elements. A Hopper SM can hold up to 2048 threads. 2048 threads times the number of SMs is much less than ~3.5M, so we would either want to arrange for multiple carefully sized waves at 1 element per thread, or else use a grid-stride loop with presumably one wave (but multiple loop iterations in that wave).

Assuming 114 SMs (the number varies based on PCIE vs. SXM hopper variant), and assuming such a simple kernel can achieve 100% theoretical occupancy (i.e. no other limiters to occupancy), the wave calculation looks like this:

``````65536x54 = 3,538,944
114x2048 = 233,472
3,538,944 / 233,472 = 15.157 waves
``````

The waves calculation above can be thought of as either the number of waves for a kernel design that is one element per thread, or else an â€śaverage per-thread loop countâ€ť for the grid-stride loop case.

When I do the calculation for single element per thread kernel design using 256 thread blocks, I get:

``````3538944 / 256 = 13,824
``````

not the 13864 number you mentioned.

The wave effect is usually something to consider when the number of waves would be small, say less than 8. With 15 or 16 waves, the optimization level you can go after here is on the order of 1/15 or 1/16, i.e. around 6 or 7%. So if you are concerned about 6%, then this may be something to consider. But if you need more than that, you will need to look elsewhere.

Furthermore, in my experience, when you get out to ~15 waves (really we could multiply this to a higher number based on 8 threadblocks per wave per SM), the GPU execution pattern gets pretty chaotic, so being able to actually measure a sensible ~6% variation in kernel performance may not be feasible.

1 Like

Thank you for answering! Sorry, I forgot to attach the kernel codes.
__global__ void add_b114(uint32_t *dst, uint32_t *src1, uint32_t *src2){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = tid; i < N; i += blockDim.x * gridDim.x){
dst[i] = src1[i] + src2[i];
}
}
__global__ void add_t256(uint32_t *dst, uint32_t *src1, uint32_t *src2){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
dst[tid] = src1[tid] + src2[tid];
}
add_b114 has 912(114 * 8) thread blocks and add_t256 has 13824 thread blocks. I think both kernels should show the same performance because the allocated thread blocks are the same. However, add_b114 is much slower. (over 100 times).

I note that the add_t256 kernel has no thread-check, an if statement making sure the computed `tid` value is in-bounds. Itâ€™s remotely possible that that is playing a role. Itâ€™s generally not a robust design practice anyway.