I am trying to design a distributed application that needs computing power, and in order to achieve that, I need to keep the GPU as occupied as possible. My question is the following:
Supposing i have 128 cores in my GPU. At a certain point I launch 32 thread blocks, each containing X threads. Can I launch another 32 blocks, before the first 32 blocks finish their processing ? What other difficulties should I expect if this behaviour is possible ?
You can queue them to launch, but current hardware only allows on active kernel launch at a time. The new Fermi architecture should support running up to 4 kernels simultaneously on different portions of the hardware, but even then there is no overlapping of successive kernel launches.
Thanks alot for the answer!
Hmmm … nasty stuff !
So in this case I should make sure, that at one kernel launch, I process as much data as I can. Right ?
Not necessarily. Kernel launch overhead is rather small (something of the order of 10-20 microseconds, at least on the platforms I use), so the “economics” of many or few kernel launches and much or little work per kernel really comes down to the kernel runtime versus the launch overhead. If you are planning on using a shared GPU (ie. one with a display attached), many shorter kernel launches is usually preferable to one large one because of display response and driver watchdog timer issues.