I have changed the number of threads per block used by default in cuda sample ‘MonteCarlo’ and then I used the visual profiler to see what happens. As I can see, changing the default 256 threads to 128 increases the ‘Active Blocks’ from 8 to 16 and reducing more the threads per block from 128 to 64 increases the number of ‘Active Blocks’ to 32 which is the Device Limit.
Why does this happen and why does the ‘Active thread’ number is steady despite these changes(Active Threads: Theoretical->2048 Device Limit->2048) ?
When I reduced the number of threads to 1, then the ‘Block Limit’ became 64 and exceeded the ‘Device Limit’ which is 32, also the ‘Active threads’ this time fell to 1024.
Can someone help me to understand why this happens? What if I want to reduce the overall threads used and give my application some limitations? For example can I say to an app to use 1 Grid, 16 blocks, 128 threads?