Unexplained stalls in CUDA API calls?


I’m seeing an obscure problem when running CUDA compute on the Jetson TK1 (GK20A).

The problem manifests itself as random spikes in run-time. I’ve profiled with NVVP, collecting both kernel execution times and CUDA API profiling information.

The data I’ve got suggests nothing wrong with the kernel execution times, they fluctuate by 0.1-0.2ms tops. I’ve collected the data over sufficiently long sequences of frames.

I measure the per-frame run-time as below:

CHECK_CUDA(cudaEventRecord(set_up.startEvent, 0));

  // do processing

  CHECK_CUDA(cudaEventRecord(set_up.stopEvent, 0));


I make use of pinned CPU / GPU shared memory when processing, but the majority of the load is on the GPU. The GPU writes out its results to the shared memory, and then I access them from the CPU.

My observation is I need to call one of the CUDA API synchronisation functions so that the CPU / GPU shared memory gets synced properly. Otherwise, I see incorrect contents when accessing the memory from the CPU after the GPU has written out to it.

At first, I had a simple arrangement where all kernels where executed in the default stream, and just before the CPU was to access the shared memory with the results output from the GPU, I’d call cudaDeviceSynchronize. I found on rare occasions, cudaDeviceSynchronize would stall for up to 4ms randomly.

The same would happen for me if I used cudaEventSynchronize.

I then rearranged my processing to make use of streams. Three of the kernels I need to run can be run concurrently. They all need to wait for data output from another kernel first, though. So the current arrangement I have is:

  • one kernel does the first stage of processing in stream 0
  • three kernels get submitted each to its own stream, each with a cudaStreamWaitEvent dependency on stream 0 being done with the first kernel
  • CPU then waits for each of the three kernels with cudaStreamSynchronize and then proceeds to access the shared memory to which the three have written out to

Strangely, in this arrangement, the stall moved to cudaLaunch. I found on rare occasions, cudaLaunch would stall for up to 11ms!

I’ve now added calls to __threadfence_system() at the end of all my kernels and create the streams with cudaStreamDefault rather than with the cudaStreamNonBlocking flag. That seems to be helping so far. However, I still don’t know what the problem is.

The only similar topic on the forums I could find online was https://devtalk.nvidia.com/default/topic/523698/strange-cudalaunch-stall-in-nv-visual-profiler/ but I see the run-time spikes when not profiling too. Plus, the CUDA runtime version I’ve got on the TK1 is 6.5.

Any clues please?

You might get faster / better answers by posting questions about TK1 to the sub-forum dedicated to it:


These embedded platforms are “special” in that (1) GPU and CPU share the same physical memory (2) they only have a single SMX (I think). That may lead to performance artifacts that don’t occur with standalone GPUs.

My question was answered in https://devtalk.nvidia.com/default/topic/1024484/unexplained-stalls-in-cuda-api-calls-reproducer-attached/, message #7. Thanks.