Max gridDim.x ?

zeus13i · March 10, 2010, 4:33am

My GPU is a TESLA1060 (1.3 compute capability, 16384 registers, 30 MP, 1024threads/MP, etc)
My kernel uses 9 registers.

According to the occupancy calculator:
Active Threads per Multiprocessor 1024
Active Warps per Multiprocessor 32
Active Thread Blocks per Multiprocessor 2
Occupancy of each Multiprocessor 100%

produces:

Maximum Thread Blocks Per Multiprocessor Blocks
Limited by Max Warps / Multiprocessor 2
Limited by Registers / Multiprocessor 3
Limited by Shared Memory / Multiprocessor 8

I have 2 questions.

Firstly, and most importantly, why is it that I cannot run a kernel with configuration k<<<2*30,512>>>(…) ? By experimentation I found that the largest value for the first execution configuration parameter is 31. That is, k<<<31,512>>>(…);

Secondly, how is that ‘Max Warps/MP Limit’ calculated? I mean, all warps are of size 32 and so each warp has 32 threads, no? And the total number of threads should always be a multiple of 32, right? I’m confused.

Cygnus_X1 · March 10, 2010, 7:17am

That is strange? No matter the resources usage, you should always be able to schedule as many blocks in a grid as you want (up to 65536). In worst case scenario they will be executed in sequential order, but they should still launch correctly!

Yes, so what is the problem and source of your confusion?

Maximum number of threads/SM and blocks/SM is a function of register usage and shared memory usage. And of course the block size. But I believe you already know that?

Are you asking about this: “Limited by Max Warps / Multiprocessor 2” ?

It means that the number of blocks per multiprocesor is limited by 2, because each block (in your case) launches 16 warps (512 threads) and SM cannot handle more than 32 warps - so cannot handle more than 2 blocks of yours.

avidday · March 10, 2010, 7:20am

Probably an error in the code. Certainly there is no limit close to what you are seeing (each grid dimension can be 65535). Error checking should tell you what is wrong.

On your card, there is a limit of 32 active warps per multiprocessor (Appendix A of the programming guide). Each block you are launching contains 16 warps, so the scheduling limit associated with warps per multiprocessor is 32/16 = 2.

zeus13i · March 10, 2010, 11:23pm

Thank you both, Cygnus and avidday, for your replies - my questions were answered perfectly! :)

Just a quick follow up question –

So, for example, if my limit, due to recourse usage, is 100 blocks per kernel launch and I execute a kernel as such:

kernel<<<1000,tpb>>>(…);

Will this execute:

100 blocks in parallel, 10 times
100 blocks in series, 10 times

and will gridDim.x range from 0-9 or 0-999?

From what I read, blocks execute in arbitrary order (which is why block-to-block communication is difficult), which suggests that #2 occurs, but that seems like a smack in the face as it’d nullify a lot of parallelism?

avidday · March 10, 2010, 11:33pm

There should be no resource limit which will stop you from launching any number of block, from 1 up to 65535*65535. Device resource limits apply at the multiprocessor level, and may limit how many threads per block can be run for a given kernel, but the number of blocks is independent of hardware or resources.

Neither 1 nor 2. 1000 blocks will be run. How many will be scheduled to run at the same time depends on hardware. The order of execution is undefined. gridDim.x will be 1000 in every case (it is the dimension of the grid, which is constant).

zeus13i · March 10, 2010, 11:41pm

Thanks for the swift response.

Ok, but how then does one know how many blocks and threads are executing concurrently at any given instance?

How many will be scheduled to run at once – is that easy to calculate/work out?

I made a mistake, I meant to ask will blockIdx.x range from 0-9 or 0-999? Presumeably 0-999, as we needn’t worry about the scheduling and can assume the API handles this?

avidday · March 11, 2010, 12:01am

NVIDIA publish an occupancy calculator spreadsheet which should answer most of your questions (at least at the block level). Grid level scheduling is undefined, other than being able to say that if your block level execution parameters allow M blocks per multiprocessor to be active (active meaning scheduled and available to run), and your card has N multiprocessors, there can be up to MN active blocks, and LMN active threads (if the number of threads per block is L). The up to is an important qualifier, because it may be less than that, depending on what the scheduler does, and that is not documented or determinable beforehand. The actual execution at the MP level happens in warps of 32 threads, so at any given time there can be 32N threads actually running, with the other active threads queued for execution (or awaiting either instructions or data).

zeus13i · March 11, 2010, 12:25am

Thank you so much!

Topic		Replies	Views
max number of block CUDA Programming and Performance	21	18084	April 20, 2010
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17369	September 11, 2011
number of threads and registers CUDA Programming and Performance	10	4973	March 14, 2008
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	3063	May 16, 2008
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30465	December 15, 2010
regsPerBlock CUDA Programming and Performance	4	2533	September 28, 2008
finding the best number of threads per block CUDA Programming and Performance	3	7895	January 29, 2010
Registers per thread limit and occupancy CUDA Programming and Performance	3	10163	March 30, 2007
Registers per SM GTX 460 CUDA Programming and Performance	7	2002	April 17, 2011
What data to enter into Occupancy calculator? CUDA Programming and Performance	5	666	March 29, 2023

Max gridDim.x ?

Related topics