Reducing thread per block increases number of blocks?


I have changed the number of threads per block used by default in cuda sample ‘MonteCarlo’ and then I used the visual profiler to see what happens. As I can see, changing the default 256 threads to 128 increases the ‘Active Blocks’ from 8 to 16 and reducing more the threads per block from 128 to 64 increases the number of ‘Active Blocks’ to 32 which is the Device Limit.

Why does this happen and why does the ‘Active thread’ number is steady despite these changes(Active Threads: Theoretical->2048 Device Limit->2048) ?

When I reduced the number of threads to 1, then the ‘Block Limit’ became 64 and exceeded the ‘Device Limit’ which is 32, also the ‘Active threads’ this time fell to 1024.

Can someone help me to understand why this happens? What if I want to reduce the overall threads used and give my application some limitations? For example can I say to an app to use 1 Grid, 16 blocks, 128 threads?

Thank you

You should study the code. That code is apparently choosing an overall number of threads it wants to run. If you divide that overall number by a smaller number of threads per block, you’re going to get a larger number of blocks. There’s no particular reason it has to be this way, of course. It will depend on what you want to achieve and how you want to achieve it.

Hello and thanks for your reply,

OK, so I just created a very simple cuda program and made some experiments with the number of blocks and the number of threads when calling my kernel and then used the visual profiler but again when I reduced the number of threads per block the ‘Theoretical Active Blocks’ increased. Specifically :

hello<<<160, 512>>>(ad, bd) gives ‘Theoretical Active Blocks:4’ and ‘Theoretical Active Threads:2048’


hello<<<160, 256>>>(ad, bd) gives ‘Theoretical Active Blocks:8’ and ‘Theoretical Active Threads:2048’.


What I want is to have for example 4 kernels and specify how many threads each one will use. Is there some way to achieve this? Can I limit the number of active of threads? Or limit the number of grids used?

the first question - why you need that? why not allow kernels just to execute one-after-another?

theoretical active blocks doesn’t tell you anything at all about what is actually running.

It is simply (in this case) 2048 divided by your block size.

But if you specify a block size of 256, you will get 256 threads per block, guaranteed.

The GPU will then run as many blocks at a time as it is capable of. This may vary by GPU as well as the specific resource utilization of your kernel code.

You generally cannot partition the GPU in CUDA, or restrict a kernel that is launched to use only a portion of the GPU resources. The GPU will attempt to execute your kernel as quickly as it can.

Ok, understood and thanks again. But if it’s so, then why the execution time has so much difference? I counted the execution time of MonteCarlo with the initial 256 threads and it took 0.53 sec while with 1 thread it took 9.09 sec. Why is there so much difference?

threads are allocated on hardware in blocks of 32, i.e one warp. try with 32 - it should be the same 0.5 secs

ok thank you!