max number of block

dgetrf · April 14, 2010, 4:07pm

Hi

Is there a limit for the number of blocks to spawn by <<<…>>>?

My code stops working when spawning 640 blocks, each with 16 threads. The visual profiler shows that no thread was spawned. The code works well with 320 such blocks.

Thanks.

avidday · April 14, 2010, 4:21pm

There is a limit, but it is 65535 in each direction of the grid (in Appendix A of the programming guide for future reference), so you clearly haven’t reached any sort of grid size limit. There is a limit of 512 threads per block, so I am going to guess you have the block and thread dimensions reversed in your kernel launch call. The correct order should be

kernel <<< gridsize, blocksize, sharedmemory, streamid>>>  (args)

gpgpuser · April 15, 2010, 8:29am

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:
“Registers are partitioned among all resident threads.”
“Shared memory is partitioned among all resident blocks.”

Limitations:

max number of blocks in a grid: 65535 * 65535
max number of threads in a block: 512 (1024 for compute capability 2.0)
max number of resident blocks per SM: ?
max number of resident warps per SM: ?
max number of resident threads per SM: ?
max number of registers per SM: ?
limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

gpgpuser · April 15, 2010, 8:29am

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:
“Registers are partitioned among all resident threads.”
“Shared memory is partitioned among all resident blocks.”

Limitations:

max number of blocks in a grid: 65535 * 65535
max number of threads in a block: 512 (1024 for compute capability 2.0)
max number of resident blocks per SM: ?
max number of resident warps per SM: ?
max number of resident threads per SM: ?
max number of registers per SM: ?
limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

E.D_Riedijk · April 15, 2010, 10:09am

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:

“Registers are partitioned among all resident threads.”

“Shared memory is partitioned among all resident blocks.”

Limitations:

max number of blocks in a grid: 65535 * 65535

max number of threads in a block: 512 (1024 for compute capability 2.0)

max number of resident blocks per SM: ?

max number of resident warps per SM: ?

max number of resident threads per SM: ?

max number of registers per SM: ?

limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

resident means all threads/block at one time on a multiprocessor. If you have more blocks than can be resident at one time on all multiprocessors, the excess blocks are launched after the first ones are finished.

max number of resident blocks per SM: 8
max number of resident warps per SM: 24 (1.0, 1.1), 32 (1.2, 1.3), 48 (2.0)
max number of resident threads per SM: see above * 32
max number of registers per SM: 8k (1.0, 1.1), 16k (1.2, 1.3), 32k (2.0)

In general, the maximum amount of threads per block you can launch your kernel with is dependend on the amount of registers per thread. if the amount of shared memory per block depends on the number of threads you launch, than shared memory also plays a role.

After that, number of threads, #registers and amount of shared mem being used per block limits the number concurrent blocks on a multiprocessor.

E.D_Riedijk · April 15, 2010, 10:09am

Hi, I am struggling with more or less the same issue: given the device properties, the number of required registers per thread and the number of required bytes of shared memory per thread, what is the maximum number of threads I can put in one kernel launch?

I was using the information from Appendix G of the CUDA Programming Guide 3.0 and the device properties returned by cudaGetDeviceProperties. Also, In a CUDA presentation I found:

“Registers are partitioned among all resident threads.”

“Shared memory is partitioned among all resident blocks.”

Limitations:

max number of blocks in a grid: 65535 * 65535

max number of threads in a block: 512 (1024 for compute capability 2.0)

max number of resident blocks per SM: ?

max number of resident warps per SM: ?

max number of resident threads per SM: ?

max number of registers per SM: ?

limited by use of shared memory: ?

What does “resident” mean? Are all threads/blocks in a kernel launch resident?

Thanks.

resident means all threads/block at one time on a multiprocessor. If you have more blocks than can be resident at one time on all multiprocessors, the excess blocks are launched after the first ones are finished.

max number of resident blocks per SM: 8
max number of resident warps per SM: 24 (1.0, 1.1), 32 (1.2, 1.3), 48 (2.0)
max number of resident threads per SM: see above * 32
max number of registers per SM: 8k (1.0, 1.1), 16k (1.2, 1.3), 32k (2.0)

In general, the maximum amount of threads per block you can launch your kernel with is dependend on the amount of registers per thread. if the amount of shared memory per block depends on the number of threads you launch, than shared memory also plays a role.

After that, number of threads, #registers and amount of shared mem being used per block limits the number concurrent blocks on a multiprocessor.

jam11 · April 16, 2010, 12:32pm

Is there a way to know this numbers of total residents threads using cudaprof or any other tools?

jam11 · April 16, 2010, 12:32pm

Is there a way to know this numbers of total residents threads using cudaprof or any other tools?

avidday · April 16, 2010, 12:35pm

The occupancy spreadsheet can given you this if you know the resource usage of you kernel code (itself obtainable from nvcc).

avidday · April 16, 2010, 12:35pm

The occupancy spreadsheet can given you this if you know the resource usage of you kernel code (itself obtainable from nvcc).

jam11 · April 16, 2010, 10:16pm

If I have this : cta_launched = 1444

(cta_launched : Number of threads blocks executed)

with :

Grid = 114x114

256 threads per block

N = 12996

How does it relates to the number of simultaneous threads using GTX 260?

jam11 · April 16, 2010, 10:16pm

If I have this : cta_launched = 1444

(cta_launched : Number of threads blocks executed)

with :

Grid = 114x114

256 threads per block

N = 12996

How does it relates to the number of simultaneous threads using GTX 260?

avidday · April 16, 2010, 10:31pm

It doesn’t. If you want that, you need to occupancy number for the kernel, either from the profiler or the occupancy spreadsheet I gave you the link for. You then can calculate the number of active warps per MP, and multiple that by the warp size to get the number of active threads.

avidday · April 16, 2010, 10:31pm

It doesn’t. If you want that, you need to occupancy number for the kernel, either from the profiler or the occupancy spreadsheet I gave you the link for. You then can calculate the number of active warps per MP, and multiple that by the warp size to get the number of active threads.

jam11 · April 18, 2010, 10:12pm

I just did a test to find the optimal number of threads per block for a kernel. I used the profiler to run many sessions and changed the number of threads per block until the optimal number was acheived. At the end of each run, I did an occupancy check which give a summary of what is going on and most importantly if the limit is due to number of registers or number of blocks. I change the number of threads per block acccordingly, recompiled and run again.
This is a bit tedious but did improve my program speed.

My question is : does the compiler establish the number of registers and shared memory at compile time or does it change during the run? If it is establish, why not calculate the optimum number based on these two critical values in the program itself? I will read chapter 4.2 of the programming guide again to understand what is going on.

jam11 · April 18, 2010, 10:12pm

I just did a test to find the optimal number of threads per block for a kernel. I used the profiler to run many sessions and changed the number of threads per block until the optimal number was acheived. At the end of each run, I did an occupancy check which give a summary of what is going on and most importantly if the limit is due to number of registers or number of blocks. I change the number of threads per block acccordingly, recompiled and run again.
This is a bit tedious but did improve my program speed.

My question is : does the compiler establish the number of registers and shared memory at compile time or does it change during the run? If it is establish, why not calculate the optimum number based on these two critical values in the program itself? I will read chapter 4.2 of the programming guide again to understand what is going on.

jjp · April 19, 2010, 10:36am

The compiler cannot determine the optimum number of threads. A high occupancy doesn’t imply better performance. You need a certain number of threads to hide the latencies of the processor pipeline (192 on G80/G200 cards, for Fermi that number is higher). After that a higher occupancy only helps to hide memory access latency. So if that is not a bottleneck you gain nothing. In addition to that, your algorithm may be less efficient with a higher number of threads (or more efficient, but that is rare I think). So it really is problem-specific. Occupancy can serve as a rough guideline but in the end there is no way around experimenting with the parameters.

gpgpuser · April 19, 2010, 12:34pm

Thanks for the answers,

I understand now how it works. I did some testing to find the boundaries and they confirm it.

But I could not test the boundary of the register usage. I could not increase the actual register usage (maybe the compiler optimizes code or allocates local variables in local memory instead of registers?).

Sometime I see the term “active threads/blocks”, does “active” have the same meaning as “resident”?

jam11 · April 19, 2010, 4:57pm

Are saying that if I get .418 as occupancy on GTX260

N = .418 * 32 (warp size)* 27( mp) = 361 parallel threads only ?

avidday · April 19, 2010, 5:18pm

No. Occupancy is defined as the ratio of the number of active warps per multiprocessor to the maximum number of active warps per multiprocessor.

Topic		Replies	Views
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	2983	May 16, 2008
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17194	September 11, 2011
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1415	December 19, 2023
Max threads/blocks CUDA Programming and Performance	10	71	September 6, 2024
Max # of blocks? CUDA Programming and Performance	10	9974	November 28, 2007
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30279	December 15, 2010
Test Multi Threading Spinning CUDA Programming and Performance	32	4809	July 20, 2011
too many resources requested for launch CUDA Programming and Performance	28	24827	December 1, 2010
cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem CUDA Programming and Performance cuda	22	1544	April 8, 2021

max number of block

Related topics