Concurrently kernels running on one device

When can CUDA support multiple kernels running on one device concurrently? Does this have been took into consideration by Nvidia? Plus, if more kernels could run concurrently, will some other sychronization mechanism like mutex and semphore be provided? Thanks.

it’s supported on Fermi.

Right now, concurrent kernel execution isn’t supported at all. The next generation Fermi architecture is supposed to support multiple kernel execution, but there are no inter-kernel communication mechanisms that I am aware of.

Thanks a lot for the reply :">

Inter-kernel communication is not supported under pain of me glaring at you really hard.

Which reminds me: The programming guide only says that up to 4 kernels can run simultaneously on the compute 2.0 devices. There does not seem to be any control as to how chip resources are divided between the kernels. Is there some kind of scheduler in the driver that decides how to distribute queued kernels over the multiprocessors? Is it round-robin, load-balancing, something else? I assume we are guaranteed that a stream can’t starve, but probably nothing else.

Another question is the cache and how it’s handled?

Is it only valid within a kernel or also across serveral calls of the same kernel? This is very important if you need global synchronization … then you have to call your kernel multiple times and it would be very bad if the cache would be invalidated for every new kernel?

Are there information about this?

In the programming guide it has a section about using streams, and providing the stream to the kernel execution as the fourth parameter:

<<<blockx, blocky, 0, streamID>>>

Apparently under the basis that kernels from different streams could execute concurrently. Is this just placeholder code for the Fermi architecture then? I implemented it & it made a really insignificant performance gain, so I’m guessing it is serializing in the background.

This relates in my mind to the matrix multiplication example in the guide, is the example merely a demonstration of how to use shared memory with no reason to use it in the real-world? In the real world I would want to multiply x thousand/million matrices at the same time, in which case shared memory is useless, there’s not enough of it. And if kernels cannot execute concurrently serializing the CUDA multiplications is no better than doing all on the CPU? This is continually the problem I’m having, I cannot find any scenario to use shared memory, because the block sizes of my kernels are too large. Lets say I’ve got a bayer conversion on an image, image is 1024x768 pixels = 786432 kernel “iterations”. That leaves less than 1 byte per pixel of shared memory… ?

So I start to think maybe I should have lots of smaller serialized kernels like the matrix multiplication, but then I would have to construct a framework to subdivide the image into serialized sections, small enough to utilize shared memory, which just doesn’t seem right on a processor that can support so many parallel operations. Plus the CUDA examples (denoise) seem to do things like I have.


Streams has been around for a long time, and the main purpose is to enable asynchronous copy operations and kernel/copy overlap (there is a simpleStream example in the SDK which illustrates how it works). NVIDIA have adapted the existing streams API to permit concurrent kernel execution with Fermi, but that wasn’t the original intention (or maybe it was, but the hardware just lagged 5 years behind the API design…)

You can’t launch a block with 1024x768 threads anyway.

Parallelism in CUDA (and OpenCL for that matter) should be considered at two levels - blocks and threads within a block. You can’t write a really well performing app that runs only a single block. One block will map to a single multiprocessor in hardware and a modern GPU has anywhere between 16 to 30 of them. You want hundreds if not thousands of blocks (more than # of MPs for latency hiding and scaling over future architectures).

So in your case you could partition the image into, say, 16x16 patches (256 threads within a block, lots of per-block shared memory).

With Fermi, there will be a third optional level of parallelism that you’ll be able to consider - kernel concurrency. This is orthogonal to what I’ve written before, you still won’t get to write an effective single block kernel for processing 1024x768 images.

Yep I subdivide 16x16 blocks, width/16 and height/16 = number of threads. That doesn’t really affect my point about shared memory, because it’s still shared over the entire block/thread subdivision right? Or are you saying that the threads are serialized within the block, and I can guarantee a thread runs from start -> end without swapping to another thread in the same block mid-execution?

I realize this is a rather naive/basic question, gotta start somewhere though ;)



Shared memory is counted per block, ie. you have 16KB per block.

When you declare

__shared__ float array[32];

array is allocated once per block (not per thread as the usual code semantics would suggest). Each thread within a given block will see the same array. It is your responsibility to handle any race conditions you might encounter due to shared memory.

Threads from a given block run in parallel on a multiprocessor in groups of 32 called warps. Warps can get timesliced (ex. when they stall waiting for a memory fetch). You should not assume any particular order in which blocks (and warps within blocks) are scheduled for execution.

Excellent explanation, thanks enormously! I’ll have a re-think of the problem & try again.


seibert, could you point me to the page in programming guide about the concurrent kernels support for compute 2.0 device?

I cannot find it in the document myself. And I remember in Fermi white paper, it is said to support up to 16 concurrent kernels.

It is on page 35 of the February 2010 version:

:sweat: :sweat: can you do something to change this state? I really busted my ass off trying to code a producer-consumer program on CUDA. But it seems to be mission impossible.

It would require some kind of efficient signaling mechanism between kernels, which definitely is outside the current CUDA programming style. (If you start thinking about how you would define such a mechanism in a SIMT model so that it would both make sense and also not always have terrible performance, you discover that even defining what you want is non-trivial.)

That said, producer-consumer is a convenient way to achieve coarse-grained parallelism. It is just hard to map to a data-parallel architecture, where you usually collapse the producer-consumer chain into one task, but then apply it to multiple inputs at once.

Well… thanks a lot for the reply :rolleyes: