the relationship between the execution time and blocksize

chickennight · July 4, 2017, 7:37am

i make a simple program about the vector adder. When I change the blocksize from 1024 to 512 to 256 to 128. The execution time is similar. Why? because in my view , smaller blocksize means more blocks and they can work in different cores in parrellel. So the execution time should be reduced. For example if the blocksize=512 it should be 2 blocks and they can work in 2cores together, the blocksize=256 it should have 4blocks and work in 4 cores together. So the execution time should reduce half? But in my experinment, the exxcution time is similar, why?

BulatZiganshin · July 4, 2017, 2:28pm

blocksize is the number of threads, so this doesn’t change anything as far as gridsize (i.e. total amount of threads to execute) isn’t changed

moreover, since gridsize usually is much larger than amount of GPU cores, changing this number also doesn’t have effect (usually). GPU just starts new threads on the same cores as first wave of threads finishes.

njuffa · July 4, 2017, 3:07pm

Vector addition is a purely memory-bound code, and variations in occupancy caused by different block sizes lead to performance differences that are minimal, but measurable (single-digit percentage). But for many more complicated real-life use cases, occupancy and other resource utilization issues related to thread block size can make a sizeable difference in performance, in the tens of percent.

It is possible that this effect has become less pronounced with more recent GPU architectures with more abundant resources, I don’t have enough data points to either support or refute that hypothesis.

As a rule of thumb, one should strive to have at least two active blocks on every SM, and block sizes between 128 and 256 (in steps of 32) are usually a good starting point for designing the work distribution in a CUDA kernel.

chickennight · July 4, 2017, 3:44pm

Vector addition is a purely memory-bound code, and variations in occupancy caused by different block sizes lead to performance differences that are minimal, but measurable (single-digit percentage). But for many more complicated real-life use cases, occupancy and other resource utilization issues related to thread block size can make a sizeable difference in performance, in the tens of percent.

It is possible that this effect has become less pronounced with more recent GPU architectures with more abundant resources, I don’t have enough data points to either support or refute that hypothesis.

As a rule of thumb, one should strive to have at least two active blocks on every SM, and block sizes between 128 and 256 (in steps of 32) are usually a good starting point for designing the work distribution in a CUDA kernel.

thak you for your help. But i am still confused. For example, if the globalsize is fixed=1024.case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. Do you think case2 using 4 sm cores should have the half exeution time copared to case 1? I know my principle has problem, but could you explain to me where is wrong? thank u very much

chickennight · July 4, 2017, 3:44pm

thak you for your help. But i am still confused. For example, if the globalsize is fixed=1024.case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. Do you think case2 using 4 sm cores should have the half exeution time copared to case 1? I know my principle has problem, but could you explain to me where is wrong? thank u very much

SPWorley · July 4, 2017, 3:59pm

You’re assuming that every block takes the same amount of computation no matter what the launch configuration. That certainly can be true depending on how you code your problem, and more blocks will just do more work and take longer.

But most often you write code that adapts to blocksize and especially gridsize. A common idiom is something like:

for (int index = threadIdx.x + blockIdx.x*blockDim.x; index <= MAX_N;  index+= gridDim.x*blockDim.x)
  ... do work with item identified by [index]  from 0 to MAX_N here

This idiom adapts the work per thread to both block size and grid size, allowing you to tune those for best performance. Higher grid sizes may help occupy more SMs, at a small efficiency loss, so there’s a sweetspot tradeoff (usually empirical, and usually not so sensitive). Block sizes are more often tuned, since those are often chosen to allow multiple blocks per SM (less threads means less registers and shared memory, and when small enough, you can have SMs work on blocks in parallel). But sometimes you want larger blocks for code efficiency, especially if there’s a lot of interthread comminication in shared memory. So that value is tunable as well for best overall performance.

chickennight · July 4, 2017, 4:31pm

You’re assuming that every block takes the same amount of computation no matter what the launch configuration. That certainly can be true depending on how you code your problem, and more blocks will just do more work and take longer.

But most often you write code that adapts to blocksize and especially gridsize. A common idiom is something like:
for (int index = threadIdx.x + blockIdx.x*blockDim.x; index <= MAX_N;  index+= gridDim.x*blockDim.x)
  ... do work with item identified by [index]  from 0 to MAX_N here
This idiom adapts the work per thread to both block size and grid size, allowing you to tune those for best performance. Higher grid sizes may help occupy more SMs, at a small efficiency loss, so there’s a sweetspot tradeoff (usually empirical, and usually not so sensitive). Block sizes are more often tuned, since those are often chosen to allow multiple blocks per SM (less threads means less registers and shared memory, and when small enough, you can have SMs work on blocks in parallel). But sometimes you want larger blocks for code efficiency, especially if there’s a lot of interthread comminication in shared memory. So that value is tunable as well for best overall performance.

thank you for your help.But my job is not to get the best performance, I want to run the same program in one sm cores and 2 sm cores, could u give some ideals how to do that?

chickennight · July 4, 2017, 4:34pm

You’re assuming that every block takes the same amount of computation no matter what the launch configuration. That certainly can be true depending on how you code your problem, and more blocks will just do more work and take longer.

But most often you write code that adapts to blocksize and especially gridsize. A common idiom is something like:
for (int index = threadIdx.x + blockIdx.x*blockDim.x; index <= MAX_N;  index+= gridDim.x*blockDim.x)
  ... do work with item identified by [index]  from 0 to MAX_N here
This idiom adapts the work per thread to both block size and grid size, allowing you to tune those for best performance. Higher grid sizes may help occupy more SMs, at a small efficiency loss, so there’s a sweetspot tradeoff (usually empirical, and usually not so sensitive). Block sizes are more often tuned, since those are often chosen to allow multiple blocks per SM (less threads means less registers and shared memory, and when small enough, you can have SMs work on blocks in parallel). But sometimes you want larger blocks for code efficiency, especially if there’s a lot of interthread comminication in shared memory. So that value is tunable as well for best overall performance.

what i mean is the grid size is fixed, for example = 1024. .case 1 the blocksize=512, it means we have 2 blocks and they will work in 2 different sm coes concurrently. case2:blocksize=256 so we have 4 blocks and they will work in 4 different sm cores. so the case 2 should have the less execution time?

njuffa · July 4, 2017, 5:15pm

If I understand your description correctly, what you describe are not reasonable use cases for the GPU.

As a general recommendation, a grid should comprise 20 times the number of thread blocks able to run concurrently on the SMs, or more. You may be able to achieve good efficiency with a slightly smaller grid for compute-intensive kernels.

GPUs are designed for massively parallel execution, and you need on the order of 10,000 threads at minimum to put all their resources to good use.

Topic		Replies	Views
Block size's effect on program performance, why does my program run faster at seemingly random sizes? CUDA Programming and Performance	5	3946	January 2, 2017
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	148	December 31, 2024
CUDA perormances CUDA Programming and Performance	10	7142	January 22, 2008
Performance degradation as task size grows CUDA Programming and Performance	13	650	April 25, 2023
About grid size and performance CUDA Programming and Performance	10	2442	June 25, 2010
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5991	September 5, 2008
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	867	March 30, 2024
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2366	January 23, 2012
Grids and Threads question CUDA Programming and Performance	2	4426	August 7, 2007
The choose of grid size and block size CUDA Programming and Performance	8	3703	May 8, 2024

the relationship between the execution time and blocksize

Related topics