My GPU is a TESLA1060 (1.3 compute capability, 16384 registers, 30 MP, 1024threads/MP, etc)
My kernel uses 9 registers.
According to the occupancy calculator:
Active Threads per Multiprocessor 1024
Active Warps per Multiprocessor 32
Active Thread Blocks per Multiprocessor 2
Occupancy of each Multiprocessor 100%
produces:
Maximum Thread Blocks Per Multiprocessor Blocks
Limited by Max Warps / Multiprocessor 2
Limited by Registers / Multiprocessor 3
Limited by Shared Memory / Multiprocessor 8
I have 2 questions.
Firstly, and most importantly, why is it that I cannot run a kernel with configuration k<<<2*30,512>>>(…) ? By experimentation I found that the largest value for the first execution configuration parameter is 31. That is, k<<<31,512>>>(…);
Secondly, how is that ‘Max Warps/MP Limit’ calculated? I mean, all warps are of size 32 and so each warp has 32 threads, no? And the total number of threads should always be a multiple of 32, right? I’m confused.
That is strange? No matter the resources usage, you should always be able to schedule as many blocks in a grid as you want (up to 65536). In worst case scenario they will be executed in sequential order, but they should still launch correctly!
Yes, so what is the problem and source of your confusion?
Maximum number of threads/SM and blocks/SM is a function of register usage and shared memory usage. And of course the block size. But I believe you already know that?
Are you asking about this: “Limited by Max Warps / Multiprocessor 2” ?
It means that the number of blocks per multiprocesor is limited by 2, because each block (in your case) launches 16 warps (512 threads) and SM cannot handle more than 32 warps - so cannot handle more than 2 blocks of yours.
Probably an error in the code. Certainly there is no limit close to what you are seeing (each grid dimension can be 65535). Error checking should tell you what is wrong.
On your card, there is a limit of 32 active warps per multiprocessor (Appendix A of the programming guide). Each block you are launching contains 16 warps, so the scheduling limit associated with warps per multiprocessor is 32/16 = 2.
Thank you both, Cygnus and avidday, for your replies - my questions were answered perfectly! :)
Just a quick follow up question –
So, for example, if my limit, due to recourse usage, is 100 blocks per kernel launch and I execute a kernel as such:
kernel<<<1000,tpb>>>(…);
Will this execute:
100 blocks in parallel, 10 times
100 blocks in series, 10 times
and will gridDim.x range from 0-9 or 0-999?
From what I read, blocks execute in arbitrary order (which is why block-to-block communication is difficult), which suggests that #2 occurs, but that seems like a smack in the face as it’d nullify a lot of parallelism?
There should be no resource limit which will stop you from launching any number of block, from 1 up to 65535*65535. Device resource limits apply at the multiprocessor level, and may limit how many threads per block can be run for a given kernel, but the number of blocks is independent of hardware or resources.
Neither 1 nor 2. 1000 blocks will be run. How many will be scheduled to run at the same time depends on hardware. The order of execution is undefined. gridDim.x will be 1000 in every case (it is the dimension of the grid, which is constant).
Ok, but how then does one know how many blocks and threads are executing concurrently at any given instance?
How many will be scheduled to run at once – is that easy to calculate/work out?
I made a mistake, I meant to ask will blockIdx.x range from 0-9 or 0-999? Presumeably 0-999, as we needn’t worry about the scheduling and can assume the API handles this?
NVIDIA publish an occupancy calculator spreadsheet which should answer most of your questions (at least at the block level). Grid level scheduling is undefined, other than being able to say that if your block level execution parameters allow M blocks per multiprocessor to be active (active meaning scheduled and available to run), and your card has N multiprocessors, there can be up to MN active blocks, and LMN active threads (if the number of threads per block is L). The up to is an important qualifier, because it may be less than that, depending on what the scheduler does, and that is not documented or determinable beforehand. The actual execution at the MP level happens in warps of 32 threads, so at any given time there can be 32N threads actually running, with the other active threads queued for execution (or awaiting either instructions or data).