I’m seeing an obscure problem when running CUDA compute on the Jetson TK1 (GK20A).
The problem manifests itself as random spikes in run-time. I’ve profiled with NVVP, collecting both kernel execution times and CUDA API profiling information.
The data I’ve got suggests nothing wrong with the kernel execution times, they fluctuate by 0.1-0.2ms tops. I’ve collected the data over sufficiently long sequences of frames.
I measure the per-frame run-time as below:
CHECK_CUDA(cudaEventRecord(set_up.startEvent, 0)); // do processing CHECK_CUDA(cudaEventRecord(set_up.stopEvent, 0)); CHECK_CUDA(cudaEventSynchronize(set_up.stopEvent)); CHECK_CUDA(cudaEventElapsedTime(&ms, set_up.startEvent, set_up.stopEvent));
I’m fixing the CPU and GPU clocks to maximum, as in the below script:
#!/bin/bash echo "Stopping Xorg" service lightdm stop echo "Setting GPU clock" echo 1 > /sys/kernel/debug/clock/override.gbus/state echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate echo "Setting CPU clock" echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable export CPU0_ONLINE=$(cat /sys/devices/system/cpu/cpu0/online) export CPU1_ONLINE=$(cat /sys/devices/system/cpu/cpu1/online) export CPU2_ONLINE=$(cat /sys/devices/system/cpu/cpu2/online) export CPU3_ONLINE=$(cat /sys/devices/system/cpu/cpu3/online) if [ "$CPU0_ONLINE" -ne "1" ] then echo 1 > /sys/devices/system/cpu/cpu0/online fi if [ "$CPU1_ONLINE" -ne "1" ] then echo 1 > /sys/devices/system/cpu/cpu1/online fi if [ "$CPU2_ONLINE" -ne "1" ] then echo 1 > /sys/devices/system/cpu/cpu2/online fi if [ "$CPU3_ONLINE" -ne "1" ] then echo 1 > /sys/devices/system/cpu/cpu3/online fi echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo 1530000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed export GPU_CLOCK=$(cat /sys/kernel/debug/clock/override.gbus/rate) echo "GPU clock set to $GPU_CLOCK" export CPU_CLOCK=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed) echo "CPU clock set to $CPU_CLOCK" export CPU0_ONLINE=$(cat /sys/devices/system/cpu/cpu0/online) export CPU1_ONLINE=$(cat /sys/devices/system/cpu/cpu1/online) export CPU2_ONLINE=$(cat /sys/devices/system/cpu/cpu2/online) export CPU3_ONLINE=$(cat /sys/devices/system/cpu/cpu3/online) echo "CPU0 online $CPU0_ONLINE" echo "CPU1 online $CPU1_ONLINE" echo "CPU2 online $CPU2_ONLINE" echo "CPU3 online $CPU3_ONLINE" service lightdm status
I make use of pinned CPU / GPU shared memory when processing, but the majority of the load is on the GPU. The GPU writes out its results to the shared memory, and then I access them from the CPU.
My observation is I need to call one of the CUDA API synchronisation functions so that the CPU / GPU shared memory gets synced properly. Otherwise, I see incorrect contents when accessing the memory from the CPU after the GPU has written out to it.
At first, I had a simple arrangement where all kernels where executed in the default stream, and just before the CPU was to access the shared memory with the results output from the GPU, I’d call cudaDeviceSynchronize. I found on rare occasions, cudaDeviceSynchronize would stall for up to 4ms randomly.
The same would happen for me if I used cudaEventSynchronize.
I then rearranged my processing to make use of streams. Three of the kernels I need to run can be run concurrently. They all need to wait for data output from another kernel first, though. So the current arrangement I have is:
- one kernel does the first stage of processing in stream 0
- three kernels get submitted each to its own stream, each with a cudaStreamWaitEvent dependency on stream 0 being done with the first kernel
- CPU then waits for each of the three kernels with cudaStreamSynchronize and then proceeds to access the shared memory to which the three have written out to
Strangely, in this arrangement, the stall moved to cudaLaunch. I found on rare occasions, cudaLaunch would stall for up to 11ms!
I’ve now added calls to __threadfence_system() at the end of all my kernels and create the streams with cudaStreamDefault rather than with the cudaStreamNonBlocking flag. That seems to be helping so far. However, I still don’t know what the problem is.
The only similar topic on the forums I could find online was https://devtalk.nvidia.com/default/topic/523698/strange-cudalaunch-stall-in-nv-visual-profiler/ but I see the run-time spikes when not profiling too. Plus, the CUDA runtime version I’ve got on the TK1 is 6.5.
Any clues please?