Extremely slow CUDA API calls?

Hi everyone!

I’m working on a time-critical template matching project with OpenCV, but even with various optimizations (memory considerations, multi-threading) the template matching performance was disappointingly poor. So I took a closer look with nvperf and the Visual Profiler and the CUDA API calls are taking a really long time, in some cases multiple milliseconds, when the actual operations they launch seems to take 2x to >10x less time (for example: a cudaMemcpy with a duration of 1.403 ms, but the actual memory copy’s duration is 36.8 microseconds). The times also seem to vary randomly, with the full cv::gpu::TemplateMatch call taking anywhere from 14-30 ms, usually around 16-20 ms.

As a sidenote, I am a CUDA beginner so I could be missing something really obvious or misinterpreting the data. And in case it makes a difference, I’m using JetPack 2.3.1 with the included OpenCV version 2.4 (our carrier board requires JetPack 2.3 for their drivers, so a newer version is not an option. I have however run a version of the program on Jetpack 3 and self-built OpenCV 3.4 both with and without multithreading and the issue of long and variable template matching times was apparent in all these cases).

nvpref console output [“profile01.nvvp”]:

======== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 33.04%  4.18630s       750  5.5817ms  41.293us  3.78553s  cudaLaunch
 31.44%  3.98329s       649  6.1376ms  18.642us  3.78531s  cudaMallocPitch
 31.18%  3.95064s       350  11.288ms  39.575us  3.77569s  cudaMemcpy
  1.25%  158.14ms       550  287.52us  9.4770us  9.4428ms  cudaDeviceSynchronize
  0.86%  109.25ms      1000  109.25us  33.899us  884.49us  cudaLaunchKernel
  0.57%  72.675ms     16250  4.4720us  1.2490us  2.8661ms  cudaFuncSetCacheConfig
  0.57%  72.271ms       850  85.024us  1.9780us  1.6735ms  cudaFree
  0.23%  29.712ms       800  37.139us  5.9360us  1.6743ms  cudaBindTexture
  0.22%  27.577ms        50  551.54us  73.942us  2.6249ms  cudaMemset2D
  0.22%  27.534ms       150  183.56us  103.62us  933.91us  cudaGetDeviceProperties
  0.14%  17.373ms       100  173.73us  61.445us  1.4838ms  cudaMalloc
  0.09%  10.900ms       100  109.00us  46.657us  689.59us  cudaMemcpy2D
  0.06%  7.9731ms       750  10.630us  3.0200us  764.05us  cudaUnbindTexture
  0.03%  4.4189ms       750  5.8910us     885ns  2.2307ms  cudaGetDevice
  0.03%  3.3924ms      1750  1.9380us     677ns  67.486us  cudaGetLastError
  0.02%  2.9515ms       750  3.9350us     520ns  825.60us  cudaConfigureCall
  0.02%  2.7211ms      3700     735ns     468ns  207.82us  cudaSetupArgument
  0.01%  1.6918ms      1000  1.6910us     521ns  49.260us  cudaPeekAtLastError
  0.00%  107.53us       200     537ns     312ns  1.8230us  cudaSetDoubleForDevice
  0.00%  41.863us        50     837ns     572ns  2.0300us  cudaCreateChannelDesc

Sometimes the cudaMallocPitch calls seem to be faster (nothing was changed, I just nvpref-ed the program again) [“profile02.nvvp”]:

======== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 45.39%  2.82749s       762  3.7106ms  42.438us  2.40143s  cudaLaunch
 41.65%  2.59468s       353  7.3504ms  35.096us  2.39360s  cudaMemcpy
  3.50%  218.11ms       661  329.96us  18.850us  3.1583ms  cudaMallocPitch
  2.89%  179.98ms       559  321.98us  12.393us  5.7166ms  cudaDeviceSynchronize
  1.72%  107.05ms      1020  104.95us  33.899us  878.76us  cudaLaunchKernel
  1.30%  81.289ms     16575  4.9040us  1.3010us  2.2484ms  cudaFuncSetCacheConfig
  1.19%  74.058ms       854  86.719us  2.0830us  1.3136ms  cudaFree
  0.62%  38.813ms       816  47.564us  5.8320us  1.9057ms  cudaBindTexture
  0.49%  30.431ms        51  596.69us  117.42us  1.3181ms  cudaMemset2D
  0.44%  27.214ms       153  177.87us  99.561us  2.6549ms  cudaGetDeviceProperties
  0.30%  18.640ms       102  182.74us  62.799us  1.0214ms  cudaMalloc
  0.17%  10.529ms       102  103.23us  53.061us  698.70us  cudaMemcpy2D
  0.14%  8.8204ms       765  11.529us  3.0200us  771.96us  cudaUnbindTexture
  0.06%  3.7410ms      1782  2.0990us     677ns  266.87us  cudaGetLastError
  0.05%  2.9464ms       763  3.8610us     885ns  1.3580ms  cudaGetDevice
  0.04%  2.7014ms      3756     719ns     468ns  207.92us  cudaSetupArgument
  0.02%  1.3655ms      1020  1.3380us     521ns  68.474us  cudaPeekAtLastError
  0.02%  1.2551ms       762  1.6470us     520ns  55.821us  cudaConfigureCall
  0.00%  110.91us       204     543ns     312ns  3.0200us  cudaSetDoubleForDevice
  0.00%  58.943us        51  1.1550us     624ns  16.298us  cudaCreateChannelDesc

The workflow of the program:

  1. Memory is allocated for 4 pointers with cudaMalloc()
  2. 4 cv::gpu::GpuMat-s are created using the previously malloced pointers assigned as their data pointers
  3. 2 images are cudaMemcpy()-ed (host to device) from the camera’s buffer to the first 2 GpuMats’ data pointers in thread “GetPics”
  4. The first image is trimmed with a cv::Rect to get the template in thread “GetPics”
  5. cv::gpu::matchTemplate() is called on the trimmed GpuMat-s in thread “TemplateMatch”
  6. When template matching is completed:
  • cv::gpu::minMaxLoc is called in thread "Output" to get the result
  • Thread "GetPics" usleeps for ~9ms and then steps 2-6 are repeated for the other 2 GpuMats
  • Any help/suggestions are greatly appreciated!

    Edit: added the nvpref nvvp files.
    profiles.zip (1.48 MB)


    Do you maximize the system performance with jetson_clocks first?

    sudo jetson_clocks.sh

    More, does your application depend on OpenCV for image input/output?
    If not, we have several optimized camera/multimedia pipeline without memory copy using GStreamer or Argus.
    Please check the L4T multimedia API in JetPack for details.



    Thanks for your reply. I forgot to mention, I did run sudo ~/jetson_clocks.sh before profiling. The threads are running at niceness -20 and at the highest thread priority. I have also disabled the X-server.

    The application does not depend on OpenCV for I/O, we have a Ximea xiQ USB3 camera and are using Ximea’s own API for camera input.

    I did now realize that it was constantly copying every new frame to a buffer on the TX1 (~0.3MB every 2 ms). So I reworked the program to only capture the exact 2 frames I need - and this didn’t have a noticeable effect on the profiling results.
    I don’t think further optimizing the camera input to get rid of the camera -> RAM -> “GPU RAM” inefficiency will help much, because it’s taking up just a tiny fraction of the total loop time (a few milliseconds at max, when openCV’s template matching function takes >15 ms for data that is already in GPU memory).

    I’m still confused as to why the API calls take so much longer than the operation they actually execute. For example here:

    1.39%  377.42ms     10004  37.726us  22.395us  82.917us  [CUDA memcpy DtoD]
      1.27%  345.39ms     10002  34.531us  27.813us  64.010us  [CUDA memcpy HtoD]
      0.31%  85.225ms      5002  17.038us  14.479us  31.303us  void cv::gpu::device::transform_detail::transformSimple<unsigned char, float, cv::gpu::device::Convertor<unsigned char, float, float>, cv::gpu::device::WithOutMask>(cv::gpu::PtrStepSz<unsigned char>, cv::gpu::PtrStep<float>, float, unsigned char)
      0.16%  44.727ms     25006  1.7880us     572ns  10.261us  [CUDA memcpy DtoH]
      0.14%  37.024ms      5002  7.4010us  4.2180us  13.750us  [CUDA memset]
      0.11%  29.459ms      5001  5.8900us  3.5420us  13.697us  void minMaxLoc::kernel_pass_2<int=256, float>(float*, float, unsigned int*, float*, int)
    ======== API calls:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     34.65%  41.2658s     75028  550.01us  40.260us  2.31302s  cudaLaunch
     13.28%  15.8176s     65024  243.26us  16.354us  9.8535ms  cudaMallocPitch
     13.13%  15.6367s     55021  284.20us  8.6980us  3.1643ms  cudaDeviceSynchronize
     11.04%  13.1422s     35008  375.41us  33.906us  2.30092s  cudaMemcpy

    None of the memcpy-s take longer than 40 microseconds on average, however for the cudaMemcpy API call the average time (over 5000 openCV template matching loops) is ~10 times longer!

    I’ve also included an image of a cudaLaunch with seemingly nothing else happening taking nearly 2 ms. It’s like this every loop. What’s happening during this time?

    Are these behaviors normal? To me as a beginner it looks like there is a lot of “waiting for nothing” going on…

    Not sure that disabling X server is a good idea for using CUDA.
    Someone more experienced may confirm or not, and give further details about this.

    I don’t know enough to help in any detail, but I can guarantee that a -20 nice level is asking for trouble. Because you are possibly (likely?) starving out something else to get this it could end up making your call slower and not faster. An example might be that your process needs data from some other driver (e.g., USB or ethernet)…but that driver has to wait until your process times out.

    If you run htop (“sudo apt-get install htop”) you will see a priority column (“PRI”), and a nice column (“NI”). Nice is what you have set, priority is what the scheduler is doing based on current activity. In a way this is a demonstration that this isn’t a hard realtime o/s and that the scheduler has some latitude. The detail to notice is that the target “NI” (nice), when it is given more priority (more negative niceness), rarely will the value be more negative than -11. I see the performance measuring tools themselves get a -20, but this is kind of a special case of a process which won’t block. If you have to set your priority more negative than perhaps -5, then you are more likely adding trouble instead of gaining speed.

    Most of the CUDA apps require X because the GPU driver goes through the X ABI. There would be some cases of apps talking directly to the GPU, but those seem to be the exception and not the rule (I couldn’t tell you what is needed to do normal CUDA work without X…someone else may have the details for working in that layer). I could imagine that if the program involved doesn’t actually try to manipulate a buffer perhaps it does not need a rendering context (a “DISPLAY”). Are you sure X is not running (by default it will respawn if it dies)? Do you see X is missing from this:

    ps aux | egrep Xorg | egrep -v grep


    The average execution time of cudaLaunch is around ~us. It should be acceptable.

    The worst case may come from the first time kernel launch.
    When code starts running, the GPU code will physically send to GPU and launch on GPU.
    Hence the first launch is slow.