Improve kernel launch times on Jetson TX2?

I am seeing what seems like relatively long kernel launch times on the Jetson TX2. Whereas from other posts on this forum I would expect ~5 us kernel launch times, I am seeing > 40 us in my profiling.

I am profiling the execution of the function “testKernelLaunch()” which launches the following null kernel:

__global__ void nullKernel()

    nullKernel<<<1, 1>>>();
    nullKernel<<<1, 1>>>();

The output I get from nvprof is the following:

==7361== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.5930us         2  1.2960us     640ns  1.9530us  nullKernel(void)

==7361== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.92%  140.38ms         2  70.191ms  41.856us  140.34ms  cudaLaunch
  0.05%  65.472us        91     719ns     384ns  16.800us  cuDeviceGetAttribute
  0.01%  18.016us         2  9.0080us  2.7840us  15.232us  cudaConfigureCall
  0.01%  15.552us         1  15.552us  15.552us  15.552us  cudaDeviceSynchronize
  0.00%  5.7920us         1  5.7920us  5.7920us  5.7920us  cuDeviceTotalMem
  0.00%  4.5760us         3  1.5250us     640ns  2.4000us  cuDeviceGetCount
  0.00%  2.2080us         3     736ns     640ns     800ns  cuDeviceGet
  0.00%  1.3760us         1  1.3760us  1.3760us  1.3760us  cuDeviceGetName

I assume that the initial kernel launch takes 140.34 ms because of the runtime initialization, but the subsequent kernel launches take more than 40 us. The function that I am trying to implement in Cuda runs in less than 1 ms on the CPU, so this kind of launch overhead for a kernel is pretty significant for my use case. Is there any way to reduce that launch time?

Hi simon472, have you tried running ~/ script or using nvpmodel tool to maximize the clocks before launching your application?

Depending on the active power scaling mode and workload, the GPU may need to spin up frequency by default.

Hi dusty, thank you for your reply. I did try to sudo ~/, and confirmed that the GPU clock is running at 1.3 GHz by using both tegrastats which outputs:

RAM 856/7853MB (lfb 1538x4MB) CPU [1%@2032,0%@2034,0%@2034,0%@2035,0%@2034,0%@2035] EMC_FREQ 0%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@41C MCPU@41C GPU@47C PLL@41C AO@39C Tboard@37C Tdiode@38C PMIC@100C thermal@40C VDD_IN 3513/3518 VDD_CPU 292/302 VDD_GPU 146/146 VDD_SOC 780/780 VDD_WIFI 326/326 VDD_DDR 1113/1113

or alternatively nvpmodel -q --verbose which outputs:

NVPM VERB: PARAM GPU: ARG MIN_FREQ: PATH /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/min_freq: REAL_VAL: 1300500000 CONF_VAL: 0
NVPM VERB: PARAM GPU: ARG MAX_FREQ: PATH /sys/devices/17000000.gp10b/devfreq/17000000.gp10b/max_freq: REAL_VAL: 1300500000 CONF_VAL: 2147483647

However, I still get ~35 us kernel launch times:

==11452== Profiling application: ./tester
==11452== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.2720us         2  1.1360us     512ns  1.7600us  nullKernel(void)

==11452== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.94%  174.11ms         2  87.057ms  37.184us  174.08ms  cudaLaunch
  0.03%  60.224us        91     661ns     320ns  16.896us  cuDeviceGetAttribute
  0.01%  17.504us         2  8.7520us  2.7520us  14.752us  cudaConfigureCall
  0.01%  15.168us         1  15.168us  15.168us  15.168us  cudaDeviceSynchronize
  0.00%  5.7600us         1  5.7600us  5.7600us  5.7600us  cuDeviceTotalMem
  0.00%  4.7680us         3  1.5890us     608ns  2.4640us  cuDeviceGetCount
  0.00%  2.0800us         3     693ns     672ns     736ns  cuDeviceGet
  0.00%  1.5040us         1  1.5040us  1.5040us  1.5040us  cuDeviceGetName

If that’s the performance I should expect from the TX2, that’s fine, I just want to know if that’s what others see as well or if by chance my system might be running slower for some reason.


Here is the profiling result of my environment:

==6336== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  670.16us      1000     670ns     640ns  1.9850us  nullKernel(void)
      API calls:   95.48%  252.03ms      1000  252.03us  25.504us  221.02ms  cudaLaunch
                    4.04%  10.666ms      1000  10.665us  8.8320us  68.127us  cudaDeviceSynchronize
                    0.45%  1.1782ms      1000  1.1780us     864ns  57.471us  cudaConfigureCall
                    0.02%  62.559us        94     665ns     320ns  18.687us  cuDeviceGetAttribute
                    0.00%  5.8240us         1  5.8240us  5.8240us  5.8240us  cuDeviceTotalMem
                    0.00%  4.8320us         3  1.6100us     608ns  2.8480us  cuDeviceGetCount
                    0.00%  1.7920us         2     896ns     736ns  1.0560us  cuDeviceGet
                    0.00%  1.0880us         1  1.0880us  1.0880us  1.0880us  cuDeviceGetName