Unexplained stalls in CUDA API calls - reproducer attached

Hi,

I’m seeing an obscure problem when running CUDA compute on the Jetson TK1 (GK20A).

The problem manifests itself as random spikes in run-time. I’ve profiled with NVVP, collecting both kernel execution times and CUDA API profiling information.

The data I’ve got suggests nothing wrong with the kernel execution times, they fluctuate by 0.1-0.2ms tops. I’ve collected the data over sufficiently long sequences of frames.

I measure the per-frame run-time as below:

CHECK_CUDA(cudaEventRecord(set_up.startEvent, 0));

  // do processing

  CHECK_CUDA(cudaEventRecord(set_up.stopEvent, 0));
  CHECK_CUDA(cudaEventSynchronize(set_up.stopEvent));

  CHECK_CUDA(cudaEventElapsedTime(&ms,
				  set_up.startEvent,
				  set_up.stopEvent));

I’m fixing the CPU and GPU clocks to maximum, as in the below script:

#!/bin/bash

echo "Stopping Xorg"

service lightdm stop

echo "Setting GPU clock"

echo 1 > /sys/kernel/debug/clock/override.gbus/state
echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate

echo "Setting CPU clock"

echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable

export CPU0_ONLINE=$(cat /sys/devices/system/cpu/cpu0/online)
export CPU1_ONLINE=$(cat /sys/devices/system/cpu/cpu1/online)
export CPU2_ONLINE=$(cat /sys/devices/system/cpu/cpu2/online)
export CPU3_ONLINE=$(cat /sys/devices/system/cpu/cpu3/online)

if [ "$CPU0_ONLINE" -ne "1" ]
then
	echo 1 > /sys/devices/system/cpu/cpu0/online
fi

if [ "$CPU1_ONLINE" -ne "1" ]
then
	echo 1 > /sys/devices/system/cpu/cpu1/online
fi

if [ "$CPU2_ONLINE" -ne "1" ]
then
	echo 1 > /sys/devices/system/cpu/cpu2/online
fi

if [ "$CPU3_ONLINE" -ne "1" ]
then
	echo 1 > /sys/devices/system/cpu/cpu3/online
fi

echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo 1530000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

export GPU_CLOCK=$(cat /sys/kernel/debug/clock/override.gbus/rate)
echo "GPU clock set to $GPU_CLOCK"

export CPU_CLOCK=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed)
echo "CPU clock set to $CPU_CLOCK"

export CPU0_ONLINE=$(cat /sys/devices/system/cpu/cpu0/online)
export CPU1_ONLINE=$(cat /sys/devices/system/cpu/cpu1/online)
export CPU2_ONLINE=$(cat /sys/devices/system/cpu/cpu2/online)
export CPU3_ONLINE=$(cat /sys/devices/system/cpu/cpu3/online)

echo "CPU0 online $CPU0_ONLINE"
echo "CPU1 online $CPU1_ONLINE"
echo "CPU2 online $CPU2_ONLINE"
echo "CPU3 online $CPU3_ONLINE"

service lightdm status

I make use of pinned CPU / GPU shared memory when processing, but the majority of the load is on the GPU. The GPU writes out its results to the shared memory, and then I access them from the CPU.

My observation is I need to call one of the CUDA API synchronisation functions so that the CPU / GPU shared memory gets synced properly. Otherwise, I see incorrect contents when accessing the memory from the CPU after the GPU has written out to it.

At first, I had a simple arrangement where all kernels where executed in the default stream, and just before the CPU was to access the shared memory with the results output from the GPU, I’d call cudaDeviceSynchronize. I found on rare occasions, cudaDeviceSynchronize would stall for up to 4ms randomly.

The same would happen for me if I used cudaEventSynchronize.

I then rearranged my processing to make use of streams. Three of the kernels I need to run can be run concurrently. They all need to wait for data output from another kernel first, though. So the current arrangement I have is:

  • one kernel does the first stage of processing in stream 0
  • three kernels get submitted each to its own stream, each with a cudaStreamWaitEvent dependency on stream 0 being done with the first kernel
  • CPU then waits for each of the three kernels with cudaStreamSynchronize and then proceeds to access the shared memory to which the three have written out to

Strangely, in this arrangement, the stall moved to cudaLaunch. I found on rare occasions, cudaLaunch would stall for up to 11ms!

I’ve now added calls to __threadfence_system() at the end of all my kernels and create the streams with cudaStreamDefault rather than with the cudaStreamNonBlocking flag. That seems to be helping so far. However, I still don’t know what the problem is.

The only similar topic on the forums I could find online was https://devtalk.nvidia.com/default/topic/523698/strange-cudalaunch-stall-in-nv-visual-profiler/ but I see the run-time spikes when not profiling too. Plus, the CUDA runtime version I’ve got on the TK1 is 6.5.

Any clues please?

Line 42 in the above script should read:

echo 2065500 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

I’ve written a stand alone reproducer which successfully triggers the problem, doing no processing at all. It simply launches a kernel for a single frame (processing unit), and then either copies from the GPU memory (scenario 1) or synchronises the pinned shared memory (scenario 2) so that the CPU can do something with what the GPU generated.

For each frame (processing unit), I measure the run-time both using CUDA API and Linux system API. The measurement includes both the kernel run-time and either the memcpy operation or the sync operation, depending on the scenario.

Following https://elinux.org/Jetson/Performance and https://devtalk.nvidia.com/default/topic/966640/jetson-tk1/jetson-tk1-cpu-performance/:
I remove the on demand features from the init scripts with

sudo update-rc.d -f ondemand remove

I fix the CPU, the GPU, and the memory clocks (I only recently found out the memory clock needs to be fixed separately)

sudo ./perf_set_up.sh

I verify all clocks are indeed fixed and monitor system load while running the reproducer with

sudo ./tegrastats

I build the reproducer for the Jetson TK1 target using the Makefile I’ve written, simply

TARGET_ARCH=armv7a make

I then run the reproducer on the Jetson TK1 and see output similar to below:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 

Running load1

        Initialising
        Disparity problem: 3.538 frame 437
                CUDA processing time: 5.293ms CPU time: 5377us
        Disparity problem: 4.957 frame 1072
                CUDA processing time: 6.712ms CPU time: 6797us

... more similar lines follow ...

        Processed 10000 frames, time avg: 2.486 min: 1.732 max: 6.919 disparity: 5.187
        Total run time: 00:00:25

Running load2

        Initialising
        Disparity problem: 3.034 frame 3727
                CUDA processing time: 3.873ms CPU time: 3905us
        Disparity problem: 3.478 frame 9157
                CUDA processing time: 4.317ms CPU time: 4402us
        Disparity problem: 3.461 frame 9600
                CUDA processing time: 4.299ms CPU time: 4385us
        Processed 10000 frames, time avg: 1.882 min: 0.839 max: 4.317 disparity: 3.478
        Total run time: 00:00:19
ubuntu@tegra-ubuntu:~/test$

My observations from running the reproducer are:

  • increasing the value for LOOPS_INNER increases the rate at which the problem occurs,
  • using pinned shared memory shows the problem less often, but shows it nevertheless,
  • setting LOOPS_INNER to 1, in the 2nd scenario (using pinned shared memory), the problem tends to disappear.

reproducer.zip (3.63 KB)

Please note I’ve profiled both reproducer scenarios with NVVP, and I observe that

  • in the first scenario, CUDA stalls at cudaMemcpy,
  • in the second scenario, CUDA stalls at cudaDeviceSynchronize,
  • all the kernel invocations have consistent run-times.



Hi,

Pinned memory doesn’t guarantee fast performance.

It’s recommended to use unified memory.
Please check this comment:
https://devtalk.nvidia.com/default/topic/1014483/jetson-tx2/zero-copy-access-cuda-pipeline/post/5170222/#5170222

Hi AastaLLL,

Thanks for your comments and suggestions. I haven’t tried using unified memory before. Would it work on the Jetson TK1, though? From the Programming Guide, the requirements are:

  • a GPU with SM architecture 3.0 or higher (Kepler class or newer)
  • a 64-bit host application and non-embedded operating system (Linux, Windows, macOS)

The TK1 is running CUDA runtime 6.5 / capability 3.2 on Linux.

I can see how the 64-bit host application requirement would be no problem on the TX1 / TX2.

However, the TK1 has a Cortex-A15 CPU, so the 64-bit host application requirement would not be met.

Hi,

Sorry for the missing. Unified memory is not available for the TK1.
If you are using pinned memory, a stall is possible due to memory control.

Hi AastaLLL,

From what I can see, the Jetson TK1 actually does support unified memory, even though it has a 32-bit CPU.

I’ve verified this first by modifying the CUDA deviceQuery sample by adding the below line:

printf("  Device supports Unified / Managed Memory:      %s\n", deviceProp.managedMemory ? "Yes" : "No");

(CUDA documentation says the managedMemory property stands for “Device supports allocating managed memory on this system”.)

Running which I got a “Yes” on the output:

./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.2
  ...
  Device supports Unified Addressing (UVA):      Yes
  Device supports Unified / Managed Memory:      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  ...

And second using cudaDeviceGetAttribute to check cudaDevAttrManagedMemory:

int attrManaged = 0;
  checkCuda(cudaDeviceGetAttribute(&attrManaged, cudaDevAttrManagedMemory, 0));
  printf("Device supports Unified / Managed Memory: %s\n", attrManaged ? "Yes" : "No");

(CUDA documentation says the cudaDevAttrManagedMemory attribute stands for “Device can allocate managed memory on this system”.)

Running which I got a “Yes” again on the output.

I’ve then modified the test program I’d attached earlier to this ticket, adding a third scenario:

void load3()
{
  if (unified_buffer == NULL) {
    printf("\tInitialising ");
    checkCuda(cudaMallocManaged((void**) &unified_buffer, NUM_THREADS * sizeof (int),
				cudaMemAttachGlobal));
    printf("OK\n");
  }
  for (uint32_t j = 0; j < LOOPS_INNER; j++) {
    test_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(unified_buffer);
    // sync shared mem between GPU and CPU
    cudaDeviceSynchronize();
    // do sth with it on the CPU
    unified_buffer[0]++;
  }
}

However, the results I get with unified memory are no better in terms of run-time stability. In fact, they are worse than the ones I get with pinned+mapped memory.

Sample output I get with the test program:

ubuntu@tegra-ubuntu:~/test$ ./reproducer 
Device supports Unified / Managed Memory: Yes

Running scenario 1 - copying with cudaMemcpy (cudaMemcpyDeviceToHost)

	Initialising OK
	Disparity problem: 4.684 frame 1414
		CUDA processing time: 5.872ms CPU time: 5966us
	Disparity problem: 4.415 frame 2366
		CUDA processing time: 5.603ms CPU time: 5688us
	Disparity problem: 4.557 frame 3318
		CUDA processing time: 5.745ms CPU time: 5830us
	Disparity problem: 4.634 frame 4270
		CUDA processing time: 5.821ms CPU time: 5906us
	Disparity problem: 4.985 frame 5222
		CUDA processing time: 6.173ms CPU time: 6257us
	Disparity problem: 4.888 frame 6174
		CUDA processing time: 6.076ms CPU time: 6159us
	Disparity problem: 4.620 frame 7126
		CUDA processing time: 5.808ms CPU time: 5892us
	Disparity problem: 3.131 frame 8037
		CUDA processing time: 4.305ms CPU time: 4391us
	Disparity problem: 4.385 frame 8078
		CUDA processing time: 5.560ms CPU time: 5607us
	Disparity problem: 4.428 frame 9030
		CUDA processing time: 5.516ms CPU time: 5563us
	Disparity problem: 4.727 frame 9982
		CUDA processing time: 5.814ms CPU time: 5900us
	Processed 10000 frames, time avg: 1.635 min: 1.088 max: 6.173 disparity: 5.085
	Total run time: 00:00:17

Running scenario 2 - no copying, using pinned mapped memory (cudaHostAlloc with cudaHostAllocMapped)

	Initialising OK
	Disparity problem: 13.135 frame 4168
		CUDA processing time: 13.707ms CPU time: 13797us
	Disparity problem: 4.313 frame 6870
		CUDA processing time: 4.885ms CPU time: 5002us
	Processed 10000 frames, time avg: 1.085 min: 0.572 max: 13.707 disparity: 13.135
	Total run time: 00:00:11

Running scenario 3 - no copying, using unified / managed memory (cudaMallocManaged with cudaMemAttachGlobal)

	Initialising OK
	Disparity problem: 4.011 frame 463
		CUDA processing time: 5.910ms CPU time: 5999us
	Disparity problem: 4.624 frame 939
		CUDA processing time: 5.854ms CPU time: 5942us
	Disparity problem: 4.278 frame 1415
		CUDA processing time: 5.506ms CPU time: 5563us
	Disparity problem: 4.919 frame 1891
		CUDA processing time: 6.148ms CPU time: 6236us
	Disparity problem: 4.083 frame 2367
		CUDA processing time: 5.310ms CPU time: 5367us
	Disparity problem: 4.714 frame 2843
		CUDA processing time: 5.941ms CPU time: 6031us
	Disparity problem: 5.034 frame 3319
		CUDA processing time: 6.261ms CPU time: 6351us
	Disparity problem: 4.898 frame 3795
		CUDA processing time: 6.125ms CPU time: 6211us
	Disparity problem: 4.835 frame 4271
		CUDA processing time: 6.063ms CPU time: 6154us
	Disparity problem: 4.928 frame 4747
		CUDA processing time: 6.155ms CPU time: 6246us
	Disparity problem: 4.907 frame 5223
		CUDA processing time: 6.134ms CPU time: 6224us
	Disparity problem: 4.883 frame 5699
		CUDA processing time: 6.111ms CPU time: 6202us
	Disparity problem: 3.739 frame 6175
		CUDA processing time: 4.965ms CPU time: 5019us
	Disparity problem: 3.102 frame 6632
		CUDA processing time: 4.328ms CPU time: 4418us
	Disparity problem: 4.687 frame 6651
		CUDA processing time: 5.913ms CPU time: 6002us
	Disparity problem: 7.595 frame 6903
		CUDA processing time: 8.821ms CPU time: 8913us
	Disparity problem: 4.216 frame 7127
		CUDA processing time: 5.443ms CPU time: 5512us
	Disparity problem: 4.915 frame 7603
		CUDA processing time: 6.142ms CPU time: 6233us
	Disparity problem: 5.142 frame 8079
		CUDA processing time: 6.367ms CPU time: 6457us
	Disparity problem: 4.078 frame 8555
		CUDA processing time: 5.303ms CPU time: 5358us
	Disparity problem: 4.936 frame 9031
		CUDA processing time: 6.161ms CPU time: 6252us
	Disparity problem: 4.663 frame 9507
		CUDA processing time: 5.888ms CPU time: 5976us
	Disparity problem: 3.826 frame 9983
		CUDA processing time: 5.052ms CPU time: 5108us
	Processed 10000 frames, time avg: 1.826 min: 1.225 max: 8.821 disparity: 7.596
	Total run time: 00:00:19
Freeing pinned shared mem
Freeing unified mem

I’ve attached the updated reproducer source code.

In view of the above, none of the three available mechanisms allows to reliably transfer / share memory from GPU to the CPU - not reliably enough for real-time applications. Would you agree with this result?
reproducer2.zip (3.86 KB)

Hi,

Thanks for your feedback.
I will check your use-case and update information to you later.

Hi,

Sorry for the late reply.

From the comment #8, scenario-3 execution time looks as stable as scenario-1.
May I know more about the issue you concern?

Thanks.

Hi AastaLLL,

I think you misunderstand.

The run-times in scenario 3 are similar to scenario 1, but neither of them is stable. Please look at the line from comment #8 as an example:

Processed 10000 frames, time avg: 1.635 min: 1.088 max: 6.173 disparity: 5.085

This shows the same workload (single frame) can run for anything between 1.635ms and 6.173ms, showing a disparity of 5.085ms. The run-time measurement includes both processing on the GPU and the transfer of memory between the GPU and the CPU.

What is more, you’ve mentioned you’d expect the performance of unified memory to be better than pinned memory but in my experiments it’s as bad as using cudaMemcpy.

On my side, using pinned+mapped memory showed less frequent, but bigger, run-time spikes. Please see the line from comment #8 below:

Processed 10000 frames, time avg: 1.085 min: 0.572 max: 13.707 disparity: 13.135

This means there was a difference of 13.135ms between the fastest frame and the slowest frame.

Please note: all frames are the same workload - there is no difference in the amount of processing required.

For my use-case, I require to efficiently share output of GPU computations for further post-processing on the CPU. I have a hard real-time requirement which means run-time spikes of anything above 3ms are not acceptable.

Hi,

Sorry for the late reply.
Quick check your source, something need to confirm first:

__global__ void test_kernel(unsigned int* y_global_hist)
{
    // do nothing in particular
    y_global_hist[threadIdx.x] = 0;
}

If you launch 128 blocks concurrently, numerous CUDA thread accesses the same buffer location.
Is this what you want?

Thanks.

Hi AastaLLL,

That’s not what I want and that’s not what I do in my use case. But that’s not the point of this ticket either.

You can remove that line and leave the kernel entirely empty and the problem will be the same.

Have you tried doing that?

Hi,

Usually, global memory access is the bottleneck of the kernel execution. Especially concurrent memory access.

If your issue still occurs without memory access, the stall may come from GPU scheduling.
Please check this document to tune Kepler performance:
http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html

Thanks.

Hi AastaLLL,

Previously, our implementation has been running part of the processing on the GPU and then sharing the intermediate results for post-processing on the CPU. This had the run-time disparity problem like the attached reproducer. The time required to process a single frame could jump up and down by anywhere between 4-11ms.

Currently, I’ve restructured the processing code to run all processing on the GPU only, to eliminate any intermediate memory transfers between the GPU and the CPU. No post-processing is done on the CPU now. With this rearrangement, I only see fluctuations of 2-3ms maximum.

I think this proves the problem does lie with using shared memory and cannot be attributed to kernel scheduling solely. We don’t queue up kernel launches in our application indefinitely: once each frame is processed, the frame buffer gets copied back to the CPU memory, and as I know from CUDA documentation, a CUDA memcpy performs an implicit CUDA device synchronisation.

I accept looking into the CUDA occupancy calculator from the document you pointed to later for further improvements, but I think you can appreciate that the biggest source of run-time instability in our application has been CPU / GPU memory sharing.

Hi,

From your last comment, could we conclude that the maximal disparity now is 3 ms rather than 11ms?

Here is a tutorial for unified memory:
https://www.youtube.com/watch?v=1rchhzTBqKk&feature=youtu.be&t=580

From the video, it shows how GPU driver handles an um buffer transfer between CPU/GPU.
It takes pages data transfer although you only access an INT parameter.

So, for a CUDA-6.5 user, it’s not recommended to access buffer with CPU/GPU by turns.
A better implementation is to assign data via CPU, process data via GPU, read results via CPU.

Thanks.

Hi AastaLLL,

From your last comment, could we conclude that the maximal disparity now is 3 ms rather than 11ms?

Unfortunately, now that I’ve added more kernel invocations, the stability problem appears to have shifted from being due to CPU / GPU memory sharing (which I no longer use) to being due to perhaps unoptimised kernel scheduling.

Based on my analysis with NVVP, the problem is with kernels I’m using to sum arrays on the GPU. I’m using a simple reduction algorithm:

  • assume the array to be summed has length N which is a power of 2,
  • first, spawn a kernel of N/2 threads, each summing 2 elems,
  • second, spawn a kernel of N/4 threads, each summing 2 elems,
  • the last summation stops when only 2 elems to be summed are left.

I need to sum quite a few arrays in my code, and each could be summed independently of one another.

I’ve tried two approaches:

I’ve chosen the number of streams to be 3 following the recommendation on the page you’ve pointed me to: “GK20A devices of compute capability 3.2 limit the number of concurrent kernels to 4”.

However, no matter which approach I choose, the kernel scheduling may get spread out in time or not, randomly.

In addition, if I follow the parallel streams approach, it’s only the first few summation kernel invocations that actually run in parallel for me: the rest get mysteriously serialised.

Please refer to the attached screenshot. You can see cudaLaunch may stall randomly for 1-1.5ms at a time.

Also, you can see that even though all work submitted into the three streams (stream 13, 14, and 15 in the screenshot) is submitted in breadth first fashion, the kernels still get serialised by the run-time.

Any clues please?
cudalaunch-stall.png

Hi,

For further investigating, do you mind to share the source code of sum algorithm with us?
Thanks.

Hi AastaLLL,

I’ve tried using an improved algorithm to do the histogram summation, modelled after one of the NVIDIA samples in 6_Advanced/reduction:

__global__ void sum_pof2_l_arr(int64_t* array_in,
					 int64_t* array_out)
{

  // do first level of reduction
  // while reading from global memory and writing to shared memory
  __shared__ int64_t data_in[HIST_NUM_BUCKETS_LVL7];

  // we should have HIST_NUM_BUCKETS_LVL7 threads
  // to reduce an array of size HIST_NUM_BUCKETS_LVL8
  uint4 vals = *((uint4*) (array_in + threadIdx.x * 2));
  int64_t sum = *(((int64_t*) (&vals))) + *(((int64_t*) (&vals)) + 1);

  data_in[threadIdx.x] = sum;
  __syncthreads();

  // do further reductions in shared memory
  for (unsigned int level = HIST_NUM_BUCKETS_LVL6; level > 0; level >>= 1) {

    // now we have too many threads than needed, so check
    if (threadIdx.x < level) {
      vals = *((uint4*) (data_in + threadIdx.x * 2));
      sum = *(((int64_t*) (&vals))) + *(((int64_t*) (&vals)) + 1);
      data_in[threadIdx.x] = sum;
    }

    __syncthreads();

  }

  // write out the final result
  // array_out contains just a single element (length HIST_NUM_BUCKETS_LVL0)
  if (threadIdx.x == 0) {
    array_out[0] = data_in[0];
  }

}

This algorithm helps reduce the overall run-time, but so far, the random stalls shift to cudaLaunch again for me, at least when running all summation kernels in a single stream, for each array one by one.

I’ll try running the summation kernels in 3 concurrent streams again, this time with the new summation algorithm, and see if that helps or not.

Hi AastaLLL,

From what I’ve checked, using the new summation kernel alone doesn’t fix the fluctuation problem. I’ve tried using 3 concurrent streams running the summation, but the run-time spikes were still there.

I’m going to try a different approach. In the meantime, if you have any other clues, please let me know.