Concurrent kernel and events on Kepler


I am trying to get 14 concurrent kernels to work on a Tesla Kepler card (Tesla K20Xm with 14 SMX units).
If I launch them without recording any events before and after the launch, then I get exactly 14 concurrent kernels.
However, if I launch them with events, I only get 8 concurrent kernels.

Why is the use of events limiting me to 8 concurrent kernels?

Is there more information about:

  • hardware queues (false intra-stream dependencies)
  • implicit synchronization
  • dependency checks
  • delayed signals between sequentially issued kernel launches
  • etc?

I thought most of these problems were solved on Kepler (HyperQ - 32 hardware queues)…

At the moment I only know about:

Somehow I must have missed something…

#include <stdio.h>
#include <stdint.h>

#define NUM_STREAMS 14

__global__ void kernel_wait64(uint64_t delta) {
	uint64_t start = clock64();
	uint64_t stop = start + delta;
	while(clock64() < stop);  // you will need '-arch=sm_20' for this.

void debug_check(const char *func, int line) {
	cudaError_t code = cudaGetLastError();
	if(code != cudaSuccess) {
		const char *msg = cudaGetErrorString(code);
		printf("Error #%d: %s\n", code, msg);
		printf("Function: %s\n", func);
		printf("Line: %d\n", line);

int main(int argc, char **argv) {
	cudaStream_t stream[NUM_STREAMS];
	cudaEvent_t event_a[NUM_STREAMS];
	cudaEvent_t event_b[NUM_STREAMS];
	for(int i=0; i<NUM_STREAMS; i++) {
		debug_check("cudaStreamCreate", __LINE__);
		debug_check("cudaEventCreate", __LINE__);
		debug_check("cudaEventCreate", __LINE__);
	for(int i=0; i<NUM_STREAMS; i++) {
		cudaEventRecord(event_a[i], stream[i]);
		debug_check("cudaEventRecord", __LINE__);
		// this kernel waits for 700 million clocks.
		// so it waits less than 1 second on a Tesla K20Xm (Kepler).
		debug_check("cudaLaunch", __LINE__);
		cudaEventRecord(event_b[i], stream[i]);
		debug_check("cudaEventRecord", __LINE__);
	debug_check("cudaDeviceSynchronize", __LINE__);
	return 0;

nvcc: (you will need ‘-arch=sm_20’ for the clock64-function)

nvcc -O2 -arch=sm_20 -o events.out

Any help would be really appreciated.

The only thing I can add is that the GTX Titan (GK110) and GK208 (SM 3.5 mobile chipsets) by default limit you to 8 concurrent kernels. Not sure why events would change that on K20/K20m, but just a heads up that it is a limitation on consumer level SM 3.5 cards.

One thing I can think of… does this happen on more than one platform? i.e. Windows vs Linux?
Does the same thing happen if you add events to the CUDA SDK example? –

Maybe there is a difference if you call cudaEventSynchronize? See:

@ Griffith

You might want to try events without timing, i.e. use the cudaEventCreateWithFlags call and set the flag to cudaEventDisableTiming.
For reasons only known to NVIDIA, the pitfalls that prevent concurrent kernel execution, are not very well documented. E.g. it would be nice to have a guarentee that all functions in the CUDA libraries (most importantly cuBLAS) can be run concurrently in different compute streams, i.e. they do not set
or change any limit or do a malloc call.


The CUDA driver limits the number of simultaneous connections to a GK110 device to 8 by default. The environment variable CUDA_DEVICE_MAX_CONNECTIONS can be used to increase the number of connections to 32. The default is limited to 8 as each connection requires additional memory and resources and the majority of CUDA applications use no streams or only 2-3 streams to achieve concurrency between memory copies and kernels.

Each stream is mapped to a connection. If the number of streams exceeds the number of connections then multiple streams will share connections. The heuristic for this sharing is not documented and may change between driver releases.

In CUDA, commands submitted to a stream are guaranteed to complete in order. If the application submits a grid launch and an event record to a stream then the driver will push the grid launch, a synchronization command, and the event record to a connection. The front end will not process the event record command until the kernel launch completes and clears the synchronization token. The connection is blocked. On compute capability 3.5 devices the front end can continue to process other connections. On compute capability < 3.5 devices the front end is simply blocked.

You application uses 14 streams which is greater than the number of connections. The events records on 9th-14th stream are being blocked by previously issued work.

To fix this problem you can increase the CUDA_DEVICE_MAX_CONNECTIONS to > 14.

There are a few other forms of false serialization that can occur when you change context state. For example changing the size of the local memory heap will block execution of work. Changing the L1 configuration or shared memory bank configuration can lead to two kernels failing to execute concurrently on the same SM resources.

Ah. Thanks a lot Greg.
This solves my problem :)

Now, I simply added this line to the beginning of my program and it works:

Thanks, I totally forgot about this whitepaper.
However, even this example didn’t give the right results:

I am using only Linux and needed to execute the example explicitly with “CUDA_DEVICE_MAX_CONNECTIONS=32 ./simpleHyperQ” to get the same results as presented in the whitepaper. I did not work out of the box:

This example measures the overall execution time of all kernel calls.
I want to measure the execution time of each individual kernel call.
(btw: this example uses the deprectated cutil.h)


Yes. If I create the events with the “cudaEventDisableTiming” flag, then I get 14 concurrent kernels.
However the only reason I created the events was to get the execution time.
But maybe I should use CUPTI instead of CUDA-Events.
I don’t know which of these two methods is more acurate / involves less overhead.


Thanks for clarifying this. I remember executing SimpleHyperQ with a K20 and it worked without setting that variable manually in the past. Must be something that crept up in a new driver version.


What about default and maximum number of simultaneous connections in Fermi cards (T20A)?
Could I also setup it via CUDA_DEVICE_MAX_CONNECTIONS ?

In other words, for 2.0 devices ALL streams(connections) after cudaEventRecord are blocked until event record really finished ? Am I right?

Compute capability < 3.5 only have a single connection to the device. CUDA events will stall execution. This is why the order that commands are submitted from the API is so critical.

The various NVIDIA CUDA profilers do not use CUDA events to time the execution of kernels as CUDA events have many issues related to concurrency. The methods used by the developers tools do not have these limitations and provide more accurate timing information. The duration provided by the profilers does not include kernel setup time. The duration is bound as close as possible to the execution of the kernel code.

Hi Greg,

“The various NVIDIA CUDA profilers do not use CUDA events to time the execution of kernels as CUDA events have many issues related to concurrency”

Could you please elaborate? so how should one synchronize in real-world-apps beside the use of events? Lets say I have multiple threads running multiple work against a K20c, will events kill
concurrency some how?

Also, are you aware of performance degradation if CUDA runtime is being accessed concurrently
from a few threads, all of them directed to ONE GPU? (I’m not asking if each CPU thread is
accessing a different GPU, but rather if multiple CPU threads access the same K20c).


CUDA events are the correct method to synchronize streams. The developer has to be very careful on compute capability 3.0 and earlier devices as the events can serialize execution due to the order the work is submitted to the GPU.

If you want to get timing information then I recommend you use the developer tools as they avoid these false dependencies (sometimes at a small kernel runtime overhead).

If you are seeing a performance degradation when you have multiple CPU threads submitting work to one device then I recommend you submit a bug through the CUDA registered developer program. Given the number of blocking API calls in CUDA and the number of false dependencies in devices with compute capability < 3.5 I can definitely see places where different API call order could lead to reduced performance.

Hi Greg,
Thanks for the info. I’ll try to submit a bug.

Another followup question please. I submit a lot of CUDA calls to the runtime (memcpy’s and kernel copies) using multiple streams and threads. NSight seems to dump the information from time to time.
With so many calls to the runtime, nSight’s output is mostly dominated by this buffer flushing, which also seems to sync all threads/streams.
Is there a way to avoid that? or to tell nSight to have bigger buffers so that flushing won’t happen so many times?

thanks again

I wanted to post my own experience with the tidbit below. Basically GTX Titan can actually do 32 streams at the same time, and GK208 can do 16 streams at the same time in Linux (as opposed to 8, as I previously thought)

With my current Windows 7 system, do I need create/set that environment variable as well (or the Windows equivalent)?

when I run the Hyper-Q example from the SDK, it seems to be using 8 streams. Just want to make sure I am interpreting that correctly.

Using TCC driver with Tesla K20c Visual Studio 2010 IDE .

I was able set that variable to 32 , and it was initially to 8. I verified using echo %CUDA_DEVICE_MAX_CONNECTIONS% and then ran the CUDA SDK example hyperQ:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Debug>nvprof --print-gpu-trace simpleHyperQ.exe
======== NVPROF is profiling simpleHyperQ.exe...
======== Command: simpleHyperQ.exe
starting hyperQ...
GPU Device 0: "Tesla K20c" with compute capability 3.5

> Detected Compute SM 3.5 hardware with 13 multi-processors
Expected time for serial execution of 32 sets of kernels = 0.640s
Expected time for fully concurrent execution of 32 sets of kernels = 0.020s
Measured time for sample = 0.021s

When it was set to default 8 the running time was 0.05 s, after it is 0.021s so it worked!

And odd side effect of setting that environment variable has been that it seems CUDA has re-set the Tesla K20c as device 0, and ‘lost’ my GTX 680. Before the GTX 680 was device 0, and the K20c device 1, and when I ran deviceQuery or CUDA-z that is how they would appear.

Now both CUDA-z and the Device query only see the Tesla K20c, and then hang until I manually break.

The funny thing is that I can still use the GTX 680 for games with no issues, and my code which runs the the K20, still works fine.

Anybody have any insight to this issue? If I set that variable back to 8 will it see the GTX 680 again?

Yep, that explains it(probably not enough memory for the 2GB GTX 680, even though I cannot use Hyper-Q on it). When I set that variable back to 8 the 680 reappeared unharmed. Resolved, but good to know how to get that 32 concurrent via hyperQ, which works as advertised.

Still hope someone will let me know if HyperQ is enabled on the GTX 780(compute 3.5), and the max number of concurrent kernels.

Could you elaborate a bit on this statement?

I can get the Hyper-Q example from the SDK to work correctly for 32 concurrent kernels in 21 ms, but when I try to launch a stream of 4 rather large kernels it ends up serializing them. When I say large I mean the problem space that is being explored in the kernel (each kernel launch parameters are <<<(3578644,1,1),(256,1,1)>>>()).