Simultaneous kernel executions not possible? Disappointing news for me

According to CUDA FAQ (here in forum), multiple kernels can not be executed on a single device at the same time. This limitation seems to narrow the usability of the framework, since some classes of tasks can not be effectively splitted, say, into 16*768=12288 or more threads. It will be perfect to solve such tasks in 128 threads each, independently starting these tasks and maintaining full device load.

Actually when starting with CUDA I’ve expected that there is a kind of scheduler, allowing independent start and finish of diverse tasks. But my experiments quickly shown that kernel calls are serialized, even if started in different host threads.

Here are some questions:

  1. Am I right assuming all above? Kernel scheduling is completely out of scope? Even in CUDA 2.0?
  2. Anything regarding this in the roadmap?
  3. Any undocumented options or hacks to gain such functionality?
  4. If there is always a single kernel running, what’s the use of different (say 16) MPs running exactly the same code? There could be only 1 MP with 128 stream processors doing the same thing…

Yes and yes. My understanding is that this is a hardware design decision and thus future hardware might allow multiple simultaneous kernels.

There was a vague hint about something that might be this in the roadmap they showed at NVISION 08. I don’t remember anything about it now, but here is the post I made shortly after the conference:…st&p=431970

The presentation this was in doesn’t appear to be available online (at least at NVIDIA’s website).

Put each of your separate “kernels” in a switch or if chain that determines which operations to run based on the block index. It isn’t pretty, but it works.

Well, I’m sure there are a lot of engineering design decisions that go into what is in a single MP. Issues of heat dissipation, size, number of transitors, cache sizes, data movement, attachment to the memory subsystem, … Plus, the design of a relatively small MP allows NVIDIA to scale up or down their GPUs performance just by adding/subtracting MPs from the chip (i.e., look at the GTX 280 with 30 MPs!).

GPUs are optimized for highly data-parallel algorithms, so we just have to live with it. There are certainly cases where it might be nice to independently schedule kernels to MPs, but the graphics side doesn’t seem to need it now. CUDA is what NVIDIA could give us for GPGPU without adding a significant number of transistors to the hardware over the graphics functionality.

I’m still skeptical though: launch overhead is already high, imagine what it will be if you start launching 30x more kernels to keep all the MPs busy!

I think it would also be interesting to see support for multiple simultaneous kernels added…however, in my (admittedly limited) experience with CUDA, I’ve also found that even those algorithms which don’t scale to lots of threads may experience good speedups if your program will need to execute those “serial” algorithms on a large number of data elements (data-parallelism). I’ve found that learning to think in this parallel manner (data-parallel, not just instruction-parallel) has been the hardest part of learning CUDA for me.

Thanks for a quick reply.

Sure I could do it - but only if these operations take relatively the same time to execute. I don’t want 15 MPs to be idle while the remaining 1 is still executing the last long task. Another issue is a “launch latency” - I’m unable to initiate some new task until this very massive kernel is completely finished.

I think there could be also a “dynamically scheduling” solution: every thread after finishing some tasks looks to the queue of tasks, and this queue is dynamically appended by CPU which manipulates device memory. Rather dirty solution since there are no synchronization primitives to avoid race condition on this queue (GPU vs. CPU). It might be a nightmare making this stuff stable. Probably someone have already done this work?

As to design, I’m almost sure that different MPs are intended to be able to run different shaders. Still, we are unable to run different kernels via framework. As to transistors, reducing number of instruction pipelines will reduce number of transistors, huh? I agree with your “scaling” argument, still it would be possible to scale up and down according to the number of SPs. As a result, I don’t think that limitation is in the hardware.

Anyway the launch overhead is controlled by a programmer… For example if launch takes even 1ms (actually less as you know), then every 100ms kernel is launch-effective, since launch takes at most 1% of time… I don’t think anyone will care such level of overhead, right? Still we would be able to launch plenty of such 100ms kernels…

I think that if an algorithm badly scales to lot’s of threads - it does so whatever you’re having in mind (either instruction or data parallelism, which are actually the same thing, to my mind).

For example as a part of my work I have a need to merge multiple pairs of sorted arrays (having size about 1000-100000 elements).

  • On the one hand, I don’t want to do all them in 1 kernel (since amount of work can be significantly different, and many MPs will just be idle most of time).

  • On the other hand, I can’t do a single merge in 12288 threads. Hmm, actually I can, but for specified array sizes it is expected to take even more time than in single block.

  • On the third (LOL) hand, we all understand that merging sorted arrays is a task that can significantly benefit from streaming SIMD (what NVIDIA calls SIMT). Still I can not efficiently use this in CUDA. Rather disappointing for me.

I was just speaking in a very general sense about transistors, nothing specific in mind. Just trying to point out that there are a lot of different elements that go into the hardware design so who are we to judge the engineers?

Also: don’t forget that each MP has an instruction decoder. Your suggestion of one “super” MP with 128 ALUs would make the warp size 512! (keeping all other design decisions the same). That would be an even more data-parallel design than the current 1 decoder -> 8 ALUs design. Thinking about it now, the GPU design really is just a a variable number of SPs, it is just the MP (8 ALUs) that is the smallest unit that the SPs can be increased/decreased by. It is just semantics. We are only running into difficulty here because for the purposes of this discussion you want to see each MP as a separately controlled/independent entity more like a SMT CPU core.

Umm, the launch overhead is relatively static and not at all controlled by the programmer. Just time how long it takes to run an empty kernel and you’ll get about 20-30 microseconds: that is the overhead I was referring to. What I was thinking when I made the statement about the overhead is this: I typically launch ~1 ms kernels across the whole GPU (30 MPs) with an overhead of 30 microseconds. If I instead had finer control, and it made sense, to break this problem into 30 separate kernels each to one MP, then the overhead would be 30 microsconds * 30 = 0.9 milliseconds! This overhead is the same time as the total execution of the original kernel.

And no, I don’t have control over the time spent in the kernel. I can only do so much before a global sync is required and the kernel has to terminate. Sure, larger system sizes can lead to longer run times and less overhead, but I don’t always want to run a larger system.

I agree that at the 100 ms level, the overhead would not be a big deal so you are fine there. I jumped to the assumption I did because most on the forums who have asked for concurrent kernels in the past have also complained that the ~20-30 us launch overhead is too big for small tasks and want concurrent launches so they can run many small tasks at once…

I slightly disagree with your argument regarding that warp size of 512.
Even having warp size 32, we can always execute less threads (e.q. if (threadIdx.x < 8)), by making some threads conditionally inactive.

Let me remind, that we are discussing tasks which can not be effectively distributed over all MPs. In such a case, there is no difference - either we do not utilize some MPs, or just do not utilize some threads of 512-warp.

My point is - that in current design of CUDA, all MPs seem to do exactly the same (and even might be synchronous?) work. This is still rather strange for me; Being optimistic enough, I really hope that soon we will get, say, version 2.1 with an option of dynamic scheduling :-) Or at least an option to specified kernel affinity (subset of MPs), thus allowing to run another kernel with complementary affinity.

MPs are not all doing the same work, each has an instruction decoder and can schedule a unique set of instructions to its SPs. That’s why you don’t loose much speed if you have block-aligned or even warp-aligned branches. In fact, if you really wanted, you could write a kernel which does two or more completely unrelated tasks, like:

__global__ gameEngineIteration(...) {

if(blockId.x<100) {

//do collision detection


else if(blockId.x>=100 || blockId.x<200) {

//process AI


else ...


As long as the branches don’t occur mid-warp, there’s nothing wrong with that (I think).