Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor

All,

I am new to CUDA. The “Performance Guidelines” section of the CUDA manual says that a global memory access takes lotsa clock cycles and that it would be prudent to overlap it with execution of another block on the multi-processor.

From the programmer’s view point – HOw do I get to do this? WHen I launch my kernel I just specify the block and grid dimensions? How do I specify how many blocks have to be exeucted in a Multi-Processor???

Is it done by launching another kernel?? In that case, Can a Multi-Processor multi-task between blocks belonging to different kernels?? If so, who effectuates this multi-tasking? I have one more reason to think of it as a different kernel. Because the manual talks about sharing the shared-memory among the two different blocks. It says the shared memory usage of a block must at the MAX be 1/2 of shared memory available. This also covers usages like one block using 1/4th shared memory and other block using 1/2 of shared memory and one another block using 1/8th of shared memory and so on. These varying sizes are possible only if they belong to different kernels.

Also, the manual talks about “first half” and “second half” of warps. Does it mean that these warps can be scheduled out and scheduled-in just once in their life time??? Has this half-warp thing got something to do with block scheduling?

THanks for any help.

Best Regards,
Sarnath

You should launch 100’s to 1000’s of blocks in a single kernel launch. The device interleaves block execution with no intervention from you. It will also interleave computation and memory access among warps in the same block. All you have to do is provide a kernel that performs more computation than memory access.

The device will not multitask multiple kernels, they are queued.

The max 1/2 usage of shared memory is needed to get two blocks to be run concurrently on a single multiprocessor. I.e. the sharing is with respect to the multiprocessor, not the blocks.

First and second half warps only relates to memory coalescing and shared memory bank conflicts. It has nothing to do with scheduling. The warp is still the smallest unit of execution on the device.

Thanks for your reply. I just want to know what is so “half” about a half-warp. I just cant comprehend it.

Also, A WARP is nothing but a sub-group of threads within a “block”. Am I right? If there are enough computing resources available on a “multiprocessor” – the entire block could be scheduled on the multi-processor – Am I right? So, this means that all the WARPS of a thread would be simultaneously executed in the MultiProcessor. – Is my understanding right?

When you say “device” – do you refer to the “MultiProcessor” OR the device as a whole? In other words, Can I get multiple kernels running simultaneously on the device provided I have enough MultiProcessors on my device? (assuming that a sub-set of MPs execute one kernel and the other subset execute other and so on…)

Thanks for your answer.

Conceptually, a warp is just a group of threads. But it has special meaning to the hardware. Each multiprocessor has only 8 ALUs and one instruction unit. The instruction unit runs at a clock 1/4 of the ALUs so a single instruction is processed 32 times on the multiprocessor before moving on to the next instruction. So you see, 32 is really the smallest unit of execution on the device.

An entire block is ALWAYS placed entirely on one multiprocessor (for access to that procs shared memory, texture cache, etc…). All the warps of that block are run in an interleaved fashion. So, conceptually as you program you imagine that all threads are being executed simultaneously, but in reality the multiprocessor is interleaving all the warps. And the interleaving happens at the instruction level. That is, after warp one executes instruction 1, the multiproc then goes on to warp 15 and executes the current instruction there. Then goes on to warp 100 and executes an instruction there … All the while, any warp “waiting” for a global memory read is not executed until the global memory read is finished.

It’s not spelled out in the guide exactly, but IIRC it is something like the clock rate of the memory unit is half that of the ALU. Sine the ALU processes 8 threads at once, the memory unit kicks in after every 16 threads have had their instructions executed.

I mean the entire GPU: all 16 multiprocs plus the block scheduler. It is not possible to run multiple kernels on different multiprocessors. This feature has been requested many times, and judging by their posts, NVIDIA reps have seriously considered it. However, it would require some major changes to the block/grid computation model so I don’t see it likely that they would implement it.

If you really need this functionality and can live with the same block dimensions among all the kernels, you can branch with an if on blockIdx.x to implemented multiple “different” kernels.

Mr.Anderson,
Thanks for your detailed reply

Figure 2-1 , Pg number 9, Pg 21/125, of NVIDIA CUDA programming guide shows as if the card executes multiple kernels at the same time. Please read the line below the figure. It really means that the kernels are being executed simultaneously…

So, Is this a bug in their manual ???

Can an NVIDIA representative clarify it ???

This is not a bug.
Only one kernel can be executed on the device at the time.
The arrow on the left indicates the time scale.
The host invokes Kernel1 and it gets executed on the device.
Then Kernel 2 is being invoked and executed, but they are not executed in parallel since Kernel 1 is already finished when Kernel 2 starts.

Nope. Since kernel launch is an asychronous event the second kernel must also be launched immediately.

In anycase, it would be nice if nVIDIA cleans up that diagram and states clearly what can be expected and what cannot be.

Nope, kernel launch is asynchronous, but launching a second kernel blocks until the first is finished.

I had no trouble at all understanding the diagram, it is merely to show you can run different-sized kernels after eachother. The text under the figure is also quite clear.

VanDammage is correct.

Async launch and async execution are different things. When you launch kernels you get control back to your program immediately but kernels are actually queued somewhere in driver. Next kernel gets executed on the device only when previos one has finished. Only one kernel can be executed on device at a given time. Others are waiting in queue.

IMO Fig 2-1 in Programming Manual is okay.

I don’t know what’s so hard to understand about the fact that only one kernel can be executed on the device at a time. :)

Aah. Its just my fantasy for novel models of computation. Thats what makes me look for more and more in everything.

What if … NVIDIA actually designed it that way and documented wrongly… Aaa… You know I am kidding… :-)

Well , I would like to quote “Only the Paranoid Survives”.