I am seeing what seems like relatively long kernel launch times on the Jetson TX2. Whereas from other posts on this forum I would expect ~5 us kernel launch times, I am seeing > 40 us in my profiling.
I am profiling the execution of the function “testKernelLaunch()” which launches the following null kernel:
__global__ void nullKernel()
{
}
void
testKernelLaunch()
{
nullKernel<<<1, 1>>>();
cudaDeviceSynchronize();
nullKernel<<<1, 1>>>();
}
The output I get from nvprof is the following:
==7361== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 2.5930us 2 1.2960us 640ns 1.9530us nullKernel(void)
==7361== API calls:
Time(%) Time Calls Avg Min Max Name
99.92% 140.38ms 2 70.191ms 41.856us 140.34ms cudaLaunch
0.05% 65.472us 91 719ns 384ns 16.800us cuDeviceGetAttribute
0.01% 18.016us 2 9.0080us 2.7840us 15.232us cudaConfigureCall
0.01% 15.552us 1 15.552us 15.552us 15.552us cudaDeviceSynchronize
0.00% 5.7920us 1 5.7920us 5.7920us 5.7920us cuDeviceTotalMem
0.00% 4.5760us 3 1.5250us 640ns 2.4000us cuDeviceGetCount
0.00% 2.2080us 3 736ns 640ns 800ns cuDeviceGet
0.00% 1.3760us 1 1.3760us 1.3760us 1.3760us cuDeviceGetName
I assume that the initial kernel launch takes 140.34 ms because of the runtime initialization, but the subsequent kernel launches take more than 40 us. The function that I am trying to implement in Cuda runs in less than 1 ms on the CPU, so this kind of launch overhead for a kernel is pretty significant for my use case. Is there any way to reduce that launch time?
Hi simon472, have you tried running ~/jetson_clocks.sh script or using nvpmodel tool to maximize the clocks before launching your application?
Depending on the active power scaling mode and workload, the GPU may need to spin up frequency by default.
Hi dusty, thank you for your reply. I did try to sudo ~/jetson_clocks.sh, and confirmed that the GPU clock is running at 1.3 GHz by using both tegrastats which outputs:
RAM 856/7853MB (lfb 1538x4MB) CPU [1%@2032,0%@2034,0%@2034,0%@2035,0%@2034,0%@2035] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@41C MCPU@41C GPU@47C PLL@41C AO@39C Tboard@37C Tdiode@38C PMIC@100C thermal@40C VDD_IN 3513/3518 VDD_CPU 292/302 VDD_GPU 146/146 VDD_SOC 780/780 VDD_WIFI 326/326 VDD_DDR 1113/1113
or alternatively nvpmodel -q --verbose which outputs:
NVPM VERB: PARAM GPU: ARG MIN_FREQ: PATH /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/min_freq: REAL_VAL: 1300500000 CONF_VAL: 0
NVPM VERB: PARAM GPU: ARG MAX_FREQ: PATH /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/max_freq: REAL_VAL: 1300500000 CONF_VAL: 2147483647
However, I still get ~35 us kernel launch times:
==11452== Profiling application: ./tester
==11452== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 2.2720us 2 1.1360us 512ns 1.7600us nullKernel(void)
==11452== API calls:
Time(%) Time Calls Avg Min Max Name
99.94% 174.11ms 2 87.057ms 37.184us 174.08ms cudaLaunch
0.03% 60.224us 91 661ns 320ns 16.896us cuDeviceGetAttribute
0.01% 17.504us 2 8.7520us 2.7520us 14.752us cudaConfigureCall
0.01% 15.168us 1 15.168us 15.168us 15.168us cudaDeviceSynchronize
0.00% 5.7600us 1 5.7600us 5.7600us 5.7600us cuDeviceTotalMem
0.00% 4.7680us 3 1.5890us 608ns 2.4640us cuDeviceGetCount
0.00% 2.0800us 3 693ns 672ns 736ns cuDeviceGet
0.00% 1.5040us 1 1.5040us 1.5040us 1.5040us cuDeviceGetName
If that’s the performance I should expect from the TX2, that’s fine, I just want to know if that’s what others see as well or if by chance my system might be running slower for some reason.
Hi,
Here is the profiling result of my environment:
==6336== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 670.16us 1000 670ns 640ns 1.9850us nullKernel(void)
API calls: 95.48% 252.03ms 1000 252.03us 25.504us 221.02ms cudaLaunch
4.04% 10.666ms 1000 10.665us 8.8320us 68.127us cudaDeviceSynchronize
0.45% 1.1782ms 1000 1.1780us 864ns 57.471us cudaConfigureCall
0.02% 62.559us 94 665ns 320ns 18.687us cuDeviceGetAttribute
0.00% 5.8240us 1 5.8240us 5.8240us 5.8240us cuDeviceTotalMem
0.00% 4.8320us 3 1.6100us 608ns 2.8480us cuDeviceGetCount
0.00% 1.7920us 2 896ns 736ns 1.0560us cuDeviceGet
0.00% 1.0880us 1 1.0880us 1.0880us 1.0880us cuDeviceGetName
Thanks.