Max 1 or 2 concurrent kernels per SM?

I’ve seen indications that if 2 kernels can fit into the resources of a Kepler SM, they may be able to run concurrently.

Can they? If so, any examples on how to set this up?

Take a look at the Concurrent Kernels CUDA Sample.

There was also a long discussion on this topic here.

I see that sample executes only 8 concurrent kernels, which is less than or equal to 1 per SM for most Kepler cards.

I already know that is possible.

Already saw that, but thanks. It seems that is about 1 concurrent kernel per SM.

I want to know if more than one per SM is possible.

Yes. A GK208 can launch 16 concurrent kernels but it only has 2 SMX’s.

Note that the Jetson TK1 does not appear to support concurrent kernels. Maybe it’s a driver or CUDA bug.

Firstly, it depends on the max number of kernels per SM specification, as per the device’s compute capability (kepler == 3.5, and for 3.5 it is 2 if I am not mistaken)

Secondly, it depends on the various kernels’ - at the very bottom of the resident streams, next in line to be drawn/ scheduled - shared memory requirements, thread block sizes, and I believe also their number of thread blocks

Thirdly, it depends on the number of streams you have created, as work - kernels, memory transfers, etc - in the same stream executes/ processes sequentially
You need multiple streams to execute different kernels concurrently, and you need more streams than number of SMs, to start seeing multiple kernels per SM

A very practical way to observe this is to switch to the debugger perspective, to click the CUDA tab, and to note which kernel thread blocks run on which SMs (pause your application just after all kernels are launched (in different streams) via a breakpoint)

Kepler can be 3.0 or 3.5

And where is the “max number of kernels per SM specification” documented?

Regarding your 1st point: correct, I apologize

Regarding your 2nd point:

Section Concurrent Kernel Execution, p32 of the programming guide…?
If you fail to agree that this stipulates a max number of kernel per SM, then I would retract my point
Nevertheless, there is a clear max number of resident blocks per SM stipulation, and you must agree that this infers an upper kernel limit

Thank you for your advice.

My kernels are 1 thread block, 1020 threads, capability 3.5, and if I set max reg to 32, the occupancy calculator says 2 will fit in 1 SM.

How many streams do I need to enable the card to process 2 x 14 kernels concurrently on a Titan/K20?

Would that be 28 streams? And the kernels should be distributed evenly between them?

In my experience, the gpu distributes the workload between SMs evenly - this is something you can observe in the debugger perspective, when you utilize breakpoints; pay attention to how the gpu “seats” kernels and their blocks - the actual SMs thread blocks are eventually assigned to
So, instead of drawing from streams and merely packing a single SMs to full capacity at a time, before moving on to the next SM, the gpu seemingly rather iterates between SMs, beginning to assign pieces of work, and subsequently returning to previous SMs, if it still has work left that can be assigned
Again, these are merely my observations; monitor this in the debugger perspective, and draw your own conclusions

You are right
If you wish to run 28 kernels concurrently; you need 28 streams
Anything less would mean that you have not issued all kernels, or have issued multiple kernels in one or more streams, and the kernels in such streams would execute sequentially; both stipulated cases imply that the 28 kernels would not run concurrently

The key is to take ownership of the kernels you would like to run concurrently
Plan; launch and monitor in the debugger perspective; and revise if it is clear that the intended kernels do not run concurrently
You do not have to guess whether your kernels run concurrently; the debugger tools are sufficient to actually monitor it (on linux at least)


Kernels do run concurrently if there are enough resources. On the kepler devices (cc 3.x) one SMP can have 2048 threads. Since a block has max 024 threads, it implies that for maximum occupancy one can have more than 1 block per SMP. Maximum 8 blocks per 1 SMP. This is done automatically, but there are some requirements. There is a limited number of registers and shared memory. For the registers one can force spilling to the local memory (launch bound). In practice higher occupancy might not give the max performance, because it might mean lots of access to the global memory, but for one application wrote I found that higher occupancy was faster for larger systems even if there spills.

You can calculate the amount of shared memory and register using the -Xptxas -v.

If the blocks use too much shared memory and there is not enough memory for more than 1 block, than have extra loads from the global memory.
Pleas note that 2048 active threads does not mean that there are 2048 threads running in the same time. In practice only 2, or 3 warps are running. The rest are waiting for loads from the global memory and/or waiting for the cores to get free.


All right, I’ve created a kernel, 2 of which should fit in an SM according to the occupancy calculator.

1020 threads, 32 registers (set by max), 9488 bytes shared memory.

I need to launch 5214 of these kernels at a time with as much concurrency as possible.

I created 28 streams, and attempted to launch each kernel, cycling through the streams.

At count of 568, I get cudaErrorInvalidValue (11) on kernel launch. I presume this means I’ve hit a queuing limit of 23 kernels per stream, but I am unable to find info on this spec.

Anyone encountered this, or know of an example on how to manually deal with queuing large numbers of kernels?

You could use callbacks - issue callbacks within streams
So, launch x kernels in a particular stream, and then a call-back in that very same stream; and do this for all streams
The callback function would execute the moment all work in a particular stream is finished
So you could essentially have your call-back function reload a particular stream
Reference the programming guide for more information on callbacks

Alternatively, you could have a tread per stream on the host, that test the status of the stream it monitors on a regular interval, and have it re-issue
Again, reference the programming guide for stream APIs available

At the same time; I would really question your assumption that you actually have to launch 5124 kernels, instead of just 28
Why cant 28 do (all) the work of 5124?
You say you need concurrency - what does that essentially mean?

Good suggestions all. Thanks.

I need concurrency to keep the GPU as fully loaded as possible to minimize the aggregate processing time.

I suppose that a single kernel could do the work of N=5128/28 by looping through the algorithm N times, however I believe that would require more registers and I’m already spilling due to max reg = 32.

I will have to further evaluate this to determine the best approach. I am also looking at stream events to manage the queues.

Would it not take a single variable to create such a loop, rather than a array of variables (per thread)?
So would it really be that expensive?
All threads can merely test a single variable in shared memory
Or then, use a single variable per warp block; still not that expensive memory wise

Actually it is significantly more complicated than just extra registers.

There is also an additional thread contraction for a couple more complex functions which contain __syncthreads() and re-expansion to the original count and the effort to get it working again.

Not that it cannot be done, it just may be harder than managing the queues.

If I knew there was a performance advantage, it might be worth the effort.

How long does it take to launch a kernel? Or rather then, how long does it take for a SM to load a kernel from a stream and commence processing it?
I do not know
I have heard mention of ‘long’
This is should likely be the key deciding factor

Agreed. My previous experience is that kernel launch times can be overwhelmed by globsl memory conflicts.

I figured out that the error I was getting was actual invalid data input, not stream queue related.

How is launching the same kernel 5214 times in parallel different from launching the kernel just one time with a grid 5214 times as large?

One difference is where the indexing into memory occurs, since all my buffers are on the GPU. If they weren’t, I guess with streams you could interleave buffer copies.

Another difference is I am computing a small block of random numbers on the CPU for each kernel interleaved with kernel launches.
My experience with cuRand doing this on the GPU is fraught with problems and overhead.
I’ve had kernels which would not launch, until I took calls to cuRand out, due to the large amount of resources it consumed. However, this is old data from several years ago.

OTOH, a 5214 block kernel avoids the stream creation overhead.

I may test both and find out if either gives a performance advantage, but I have severe doubts about including cuRand in the kernel.

The discussion over the preceding posts was equally about launching 28 kernels, each with a grid of 1 block, such that they are guaranteed to run concurrently, having them do the work of 5124 kernels or then 5124 kernel blocks
Hence, it was not a mere comparison of 5124 kernels (each 1 block) versus (1 kernel with) 5124 kernel blocks