max number of block

Hi

Is there a limit for the number of blocks to spawn by <<<…>>>?

My code stops working when spawning 640 blocks, each with 16 threads. The visual profiler shows that no thread was spawned. The code works well with 320 such blocks.

Thanks.

There is a limit, but it is 65535 in each direction of the grid (in Appendix A of the programming guide for future reference), so you clearly haven’t reached any sort of grid size limit. There is a limit of 512 threads per block, so I am going to guess you have the block and thread dimensions reversed in your kernel launch call. The correct order should be

kernel <<< gridsize, blocksize, sharedmemory, streamid>>>  (args)

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:
“Registers are partitioned among all resident threads.”
“Shared memory is partitioned among all resident blocks.”

Limitations:

  • max number of blocks in a grid: 65535 * 65535
  • max number of threads in a block: 512 (1024 for compute capability 2.0)
  • max number of resident blocks per SM: ?
  • max number of resident warps per SM: ?
  • max number of resident threads per SM: ?
  • max number of registers per SM: ?
  • limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:
“Registers are partitioned among all resident threads.”
“Shared memory is partitioned among all resident blocks.”

Limitations:

  • max number of blocks in a grid: 65535 * 65535
  • max number of threads in a block: 512 (1024 for compute capability 2.0)
  • max number of resident blocks per SM: ?
  • max number of resident warps per SM: ?
  • max number of resident threads per SM: ?
  • max number of registers per SM: ?
  • limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

resident means all threads/block at one time on a multiprocessor. If you have more blocks than can be resident at one time on all multiprocessors, the excess blocks are launched after the first ones are finished.

  • max number of resident blocks per SM: 8

  • max number of resident warps per SM: 24 (1.0, 1.1), 32 (1.2, 1.3), 48 (2.0)

  • max number of resident threads per SM: see above * 32

  • max number of registers per SM: 8k (1.0, 1.1), 16k (1.2, 1.3), 32k (2.0)

In general, the maximum amount of threads per block you can launch your kernel with is dependend on the amount of registers per thread. if the amount of shared memory per block depends on the number of threads you launch, than shared memory also plays a role.

After that, number of threads, #registers and amount of shared mem being used per block limits the number concurrent blocks on a multiprocessor.

resident means all threads/block at one time on a multiprocessor. If you have more blocks than can be resident at one time on all multiprocessors, the excess blocks are launched after the first ones are finished.

  • max number of resident blocks per SM: 8

  • max number of resident warps per SM: 24 (1.0, 1.1), 32 (1.2, 1.3), 48 (2.0)

  • max number of resident threads per SM: see above * 32

  • max number of registers per SM: 8k (1.0, 1.1), 16k (1.2, 1.3), 32k (2.0)

In general, the maximum amount of threads per block you can launch your kernel with is dependend on the amount of registers per thread. if the amount of shared memory per block depends on the number of threads you launch, than shared memory also plays a role.

After that, number of threads, #registers and amount of shared mem being used per block limits the number concurrent blocks on a multiprocessor.

Is there a way to know this numbers of total residents threads using cudaprof or any other tools?

Is there a way to know this numbers of total residents threads using cudaprof or any other tools?

The occupancy spreadsheet can given you this if you know the resource usage of you kernel code (itself obtainable from nvcc).

The occupancy spreadsheet can given you this if you know the resource usage of you kernel code (itself obtainable from nvcc).

If I have this : cta_launched = 1444

(cta_launched : Number of threads blocks executed)

with :

Grid = 114x114

256 threads per block

N = 12996

How does it relates to the number of simultaneous threads using GTX 260?

If I have this : cta_launched = 1444

(cta_launched : Number of threads blocks executed)

with :

Grid = 114x114

256 threads per block

N = 12996

How does it relates to the number of simultaneous threads using GTX 260?

It doesn’t. If you want that, you need to occupancy number for the kernel, either from the profiler or the occupancy spreadsheet I gave you the link for. You then can calculate the number of active warps per MP, and multiple that by the warp size to get the number of active threads.

It doesn’t. If you want that, you need to occupancy number for the kernel, either from the profiler or the occupancy spreadsheet I gave you the link for. You then can calculate the number of active warps per MP, and multiple that by the warp size to get the number of active threads.

I just did a test to find the optimal number of threads per block for a kernel. I used the profiler to run many sessions and changed the number of threads per block until the optimal number was acheived. At the end of each run, I did an occupancy check which give a summary of what is going on and most importantly if the limit is due to number of registers or number of blocks. I change the number of threads per block acccordingly, recompiled and run again.
This is a bit tedious but did improve my program speed.

My question is : does the compiler establish the number of registers and shared memory at compile time or does it change during the run? If it is establish, why not calculate the optimum number based on these two critical values in the program itself? I will read chapter 4.2 of the programming guide again to understand what is going on.

I just did a test to find the optimal number of threads per block for a kernel. I used the profiler to run many sessions and changed the number of threads per block until the optimal number was acheived. At the end of each run, I did an occupancy check which give a summary of what is going on and most importantly if the limit is due to number of registers or number of blocks. I change the number of threads per block acccordingly, recompiled and run again.
This is a bit tedious but did improve my program speed.

My question is : does the compiler establish the number of registers and shared memory at compile time or does it change during the run? If it is establish, why not calculate the optimum number based on these two critical values in the program itself? I will read chapter 4.2 of the programming guide again to understand what is going on.

The compiler cannot determine the optimum number of threads. A high occupancy doesn’t imply better performance. You need a certain number of threads to hide the latencies of the processor pipeline (192 on G80/G200 cards, for Fermi that number is higher). After that a higher occupancy only helps to hide memory access latency. So if that is not a bottleneck you gain nothing. In addition to that, your algorithm may be less efficient with a higher number of threads (or more efficient, but that is rare I think). So it really is problem-specific. Occupancy can serve as a rough guideline but in the end there is no way around experimenting with the parameters.

Thanks for the answers,

I understand now how it works. I did some testing to find the boundaries and they confirm it.

But I could not test the boundary of the register usage. I could not increase the actual register usage (maybe the compiler optimizes code or allocates local variables in local memory instead of registers?).

Sometime I see the term “active threads/blocks”, does “active” have the same meaning as “resident”?

Are saying that if I get .418 as occupancy on GTX260

N = .418 * 32 (warp size)* 27( mp) = 361 parallel threads only ?

No. Occupancy is defined as the ratio of the number of active warps per multiprocessor to the maximum number of active warps per multiprocessor.