the relationship between the execution time and blocksize

i make a simple program about the vector adder. When I change the blocksize from 1024 to 512 to 256 to 128. The execution time is similar. Why? because in my view , smaller blocksize means more blocks and they can work in different cores in parrellel. So the execution time should be reduced. For example if the blocksize=512 it should be 2 blocks and they can work in 2cores together, the blocksize=256 it should have 4blocks and work in 4 cores together. So the execution time should reduce half? But in my experinment, the exxcution time is similar, why?

blocksize is the number of threads, so this doesn’t change anything as far as gridsize (i.e. total amount of threads to execute) isn’t changed

moreover, since gridsize usually is much larger than amount of GPU cores, changing this number also doesn’t have effect (usually). GPU just starts new threads on the same cores as first wave of threads finishes.

Vector addition is a purely memory-bound code, and variations in occupancy caused by different block sizes lead to performance differences that are minimal, but measurable (single-digit percentage). But for many more complicated real-life use cases, occupancy and other resource utilization issues related to thread block size can make a sizeable difference in performance, in the tens of percent.

It is possible that this effect has become less pronounced with more recent GPU architectures with more abundant resources, I don’t have enough data points to either support or refute that hypothesis.

As a rule of thumb, one should strive to have at least two active blocks on every SM, and block sizes between 128 and 256 (in steps of 32) are usually a good starting point for designing the work distribution in a CUDA kernel.

thak you for your help. But i am still confused. For example, if the globalsize is fixed=1024.case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. Do you think case2 using 4 sm cores should have the half exeution time copared to case 1? I know my principle has problem, but could you explain to me where is wrong? thank u very much

thak you for your help. But i am still confused. For example, if the globalsize is fixed=1024.case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. Do you think case2 using 4 sm cores should have the half exeution time copared to case 1? I know my principle has problem, but could you explain to me where is wrong? thank u very much

You’re assuming that every block takes the same amount of computation no matter what the launch configuration. That certainly can be true depending on how you code your problem, and more blocks will just do more work and take longer.

But most often you write code that adapts to blocksize and especially gridsize. A common idiom is something like:

for (int index = threadIdx.x + blockIdx.x*blockDim.x; index <= MAX_N;  index+= gridDim.x*blockDim.x)
  ... do work with item identified by [index]  from 0 to MAX_N here

This idiom adapts the work per thread to both block size and grid size, allowing you to tune those for best performance. Higher grid sizes may help occupy more SMs, at a small efficiency loss, so there’s a sweetspot tradeoff (usually empirical, and usually not so sensitive). Block sizes are more often tuned, since those are often chosen to allow multiple blocks per SM (less threads means less registers and shared memory, and when small enough, you can have SMs work on blocks in parallel). But sometimes you want larger blocks for code efficiency, especially if there’s a lot of interthread comminication in shared memory. So that value is tunable as well for best overall performance.

thank you for your help.But my job is not to get the best performance, I want to run the same program in one sm cores and 2 sm cores, could u give some ideals how to do that?

what i mean is the grid size is fixed, for example = 1024. .case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. so the case 2 should have the less execution time?

If I understand your description correctly, what you describe are not reasonable use cases for the GPU.

As a general recommendation, a grid should comprise 20 times the number of thread blocks able to run concurrently on the SMs, or more. You may be able to achieve good efficiency with a slightly smaller grid for compute-intensive kernels.

GPUs are designed for massively parallel execution, and you need on the order of 10,000 threads at minimum to put all their resources to good use.