Unexplained stalls in CUDA API calls?

micjan · September 22, 2017, 3:12pm

Hi,

I’m seeing an obscure problem when running CUDA compute on the Jetson TK1 (GK20A).

The problem manifests itself as random spikes in run-time. I’ve profiled with NVVP, collecting both kernel execution times and CUDA API profiling information.

The data I’ve got suggests nothing wrong with the kernel execution times, they fluctuate by 0.1-0.2ms tops. I’ve collected the data over sufficiently long sequences of frames.

I measure the per-frame run-time as below:

CHECK_CUDA(cudaEventRecord(set_up.startEvent, 0));

  // do processing

  CHECK_CUDA(cudaEventRecord(set_up.stopEvent, 0));
  CHECK_CUDA(cudaEventSynchronize(set_up.stopEvent));

  CHECK_CUDA(cudaEventElapsedTime(&ms,
				  set_up.startEvent,
				  set_up.stopEvent));

I make use of pinned CPU / GPU shared memory when processing, but the majority of the load is on the GPU. The GPU writes out its results to the shared memory, and then I access them from the CPU.

My observation is I need to call one of the CUDA API synchronisation functions so that the CPU / GPU shared memory gets synced properly. Otherwise, I see incorrect contents when accessing the memory from the CPU after the GPU has written out to it.

At first, I had a simple arrangement where all kernels where executed in the default stream, and just before the CPU was to access the shared memory with the results output from the GPU, I’d call cudaDeviceSynchronize. I found on rare occasions, cudaDeviceSynchronize would stall for up to 4ms randomly.

External Media

The same would happen for me if I used cudaEventSynchronize.

I then rearranged my processing to make use of streams. Three of the kernels I need to run can be run concurrently. They all need to wait for data output from another kernel first, though. So the current arrangement I have is:

one kernel does the first stage of processing in stream 0
three kernels get submitted each to its own stream, each with a cudaStreamWaitEvent dependency on stream 0 being done with the first kernel
CPU then waits for each of the three kernels with cudaStreamSynchronize and then proceeds to access the shared memory to which the three have written out to

Strangely, in this arrangement, the stall moved to cudaLaunch. I found on rare occasions, cudaLaunch would stall for up to 11ms!

External Media

I’ve now added calls to __threadfence_system() at the end of all my kernels and create the streams with cudaStreamDefault rather than with the cudaStreamNonBlocking flag. That seems to be helping so far. However, I still don’t know what the problem is.

The only similar topic on the forums I could find online was https://devtalk.nvidia.com/default/topic/523698/strange-cudalaunch-stall-in-nv-visual-profiler/ but I see the run-time spikes when not profiling too. Plus, the CUDA runtime version I’ve got on the TK1 is 6.5.

Any clues please?

njuffa · September 22, 2017, 4:39pm

You might get faster / better answers by posting questions about TK1 to the sub-forum dedicated to it:

[url]https://devtalk.nvidia.com/default/board/162/jetson-tk1/[/url]

These embedded platforms are “special” in that (1) GPU and CPU share the same physical memory (2) they only have a single SMX (I think). That may lead to performance artifacts that don’t occur with standalone GPUs.

micjan · November 16, 2017, 6:19pm

My question was answered in Unexplained stalls in CUDA API calls - reproducer attached - Jetson TK1 - NVIDIA Developer Forums, message #7. Thanks.

Topic		Replies	Views
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2950	October 18, 2021
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	991	August 7, 2016
Performance spikes on Jetson TX1 using CUDA multithreading Jetson TX1	2	733	October 18, 2021
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2808	November 17, 2021
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1488	June 5, 2014
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1181	February 4, 2021
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1777	July 19, 2022
CUDA hangups Jetson TK1	26	3672	October 18, 2021
CUDA non-default stream synchronization CUDA Programming and Performance jetson-orin	4	229	October 30, 2024
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	718	October 8, 2016

Unexplained stalls in CUDA API calls?

Related topics