Maximum number of blocks

Hello,

I’m running a CUDA kernel passing the usual number of blocks, and threads per block as arguments. I’m having problems executing the kernel with large number of blocks (>= 2000) to process a very large workload (about 200 MB).

I thought that this amount would be fine because deviceQuery indicates that the maximum number of blocks in dimension x is 2^31-1.

Is there a problem for the allocation of resources with larger number of blocks? My GPU have 8 GB of global memory. The number of threads per block is (32,64,…,1024).

If you need more information, please let me know.

Thanks

Hi Henrique,

The maximum number of blocks is in the billion and using 2000 blocks with 200 MB of memory isn’t actually very large so I doubt that it’s a problem.

Can you please describe the issue you’re having and possibly provide a reproducing example?

Thanks,
Mat

Actually, the problem is about the runtime I get for the application with a particular problem size and number of threads per block, specifically, 32. For example:

threads per block = 32
N = 1M, T = t1
N = 2
M, T = t2 > t1
N = 4M, T = t3 > t2

N = 32
M, T = t6 < t3

where N is a multiple of the problem size M (2000 <= M <= 5000). If I use more than 32 threads per block (64,128,…,1024), t6 > t5, which is expected because the problem size is larger. However, with 32 threads per block, the runtime is much lower than t5 (close to either t2 or t3), so I think the application is not executing properly in this case.

I thought it had to do with the allocation of resources, because I set the number of blocks equals to the problem size divided by the number of threads, that is, for the problem size N = 32*M, the number of blocks is M with 32 threads per block, and M/2 with 64 threads per block, and so on.

What could be the reason for this difference in runtime with 32 threads per block?

What could be the reason for this difference in runtime with 32 threads per block?

Each streaming multiprocessor (SM) unit can run a maximum of 2048 concurrent threads (the number of SMs will vary from device to device, but a V100 has 80). However, each SM can run a maximum of 16 or 32 concurrent blocks depending on the device (P100 and V100 runs 32, older cards run 16). So with 32 threads per block, you can run a maximum of 50% utilization (1024 threads).

Assuming no other limiters (such as register and shared memory usage), to reach 100% utilization, you must have minimum block size of 64.

Hope this helps,
Mat

Sorry, I didn’t understand part of your answer. I’m using Pascal (20 SMs) and Volta (80 SMs) GPUs, so both have 32 concurrent blocks. Suppose I could set 32000 blocks and 32 threads per block, so the total number of threads is 1024000.

This way, 32000 blocks / 32 blocks per SM = 1000 SMs would be necessary to run all blocks concurrently. Considering all SMs are used in the Pascal GPU, 980 blocks would be stalled waiting to be processed, then 960, and so on. But what it has to do with the 32 threads per block, how this affects the application? Why for >= 64 threads per block it works (<= 992 blocks)?

This value of 32 threads per block that I mentioned the application does not work is for one input dataset with N = 32M. For another input dataset with N = 32M, it only works with >= 256 threads per block (<= 539 blocks).

This way, 32000 blocks / 32 blocks per SM = 1000 SMs would be necessary to run all blocks concurrently. Considering all SMs are used in the Pascal GPU, 980 blocks would be stalled waiting to be processed, then 960, and so on.

Not quite understanding this since you seem to switch units from 1000 SMs to 1000 blocks.

Taking your example of 32000 blocks on a P100. You can have 32 blocks running concurrently on each SM. So with 20 SMs, that means you’ll have 640 blocks running at the same time and it will take 50 passes to process all 32,000 blocks.

But what it has to do with the 32 threads per block, how this affects the application?

In addition to running 32 blocks per SM, each SM can run up to 2048 threads. But if you’re limiting your block to use only 32 threads, this means you’re only using half of the threads (1024) that could be run.

So let’s take your example of 32,000 blocks but think of it in terms of threads. At 32 threads per block, this means a total of 1,024,000 threads, with a max of 20,480 running concurrently (640 blocks x 32 threads), and 50 passes to complete. But a P100 is capable to running up 40,960 concurrent threads. So to take advantage of using all the available threads, you must have a block with at least 64 threads.

Using a fixed number of 1,024,000 threads, setting your block size to 64, means only 16,000 blocks are needed, and only 25 passes to processes.

Keeping the block size fixed at 32,000, but increasing the block size to 64 threads, this means you now can run 2,048,000 threads. It will still take 50 passes, but you’ve doubled the number of threads that can work on the problem either reducing the time it takes to run each block, or allowing you to perform more work per block.

Note that this is very simplified and generalized. For specific questions about you’re actual code, please provide a reproducing example.

-Mat