Hi everyone!
I’m working on a time-critical template matching project with OpenCV, but even with various optimizations (memory considerations, multi-threading) the template matching performance was disappointingly poor. So I took a closer look with nvperf and the Visual Profiler and the CUDA API calls are taking a really long time, in some cases multiple milliseconds, when the actual operations they launch seems to take 2x to >10x less time (for example: a cudaMemcpy with a duration of 1.403 ms, but the actual memory copy’s duration is 36.8 microseconds). The times also seem to vary randomly, with the full cv::gpu::TemplateMatch call taking anywhere from 14-30 ms, usually around 16-20 ms.
As a sidenote, I am a CUDA beginner so I could be missing something really obvious or misinterpreting the data. And in case it makes a difference, I’m using JetPack 2.3.1 with the included OpenCV version 2.4 (our carrier board requires JetPack 2.3 for their drivers, so a newer version is not an option. I have however run a version of the program on Jetpack 3 and self-built OpenCV 3.4 both with and without multithreading and the issue of long and variable template matching times was apparent in all these cases).
nvpref console output [“profile01.nvvp”]:
======== API calls:
Time(%) Time Calls Avg Min Max Name
33.04% 4.18630s 750 5.5817ms 41.293us 3.78553s cudaLaunch
31.44% 3.98329s 649 6.1376ms 18.642us 3.78531s cudaMallocPitch
31.18% 3.95064s 350 11.288ms 39.575us 3.77569s cudaMemcpy
1.25% 158.14ms 550 287.52us 9.4770us 9.4428ms cudaDeviceSynchronize
0.86% 109.25ms 1000 109.25us 33.899us 884.49us cudaLaunchKernel
0.57% 72.675ms 16250 4.4720us 1.2490us 2.8661ms cudaFuncSetCacheConfig
0.57% 72.271ms 850 85.024us 1.9780us 1.6735ms cudaFree
0.23% 29.712ms 800 37.139us 5.9360us 1.6743ms cudaBindTexture
0.22% 27.577ms 50 551.54us 73.942us 2.6249ms cudaMemset2D
0.22% 27.534ms 150 183.56us 103.62us 933.91us cudaGetDeviceProperties
0.14% 17.373ms 100 173.73us 61.445us 1.4838ms cudaMalloc
0.09% 10.900ms 100 109.00us 46.657us 689.59us cudaMemcpy2D
0.06% 7.9731ms 750 10.630us 3.0200us 764.05us cudaUnbindTexture
0.03% 4.4189ms 750 5.8910us 885ns 2.2307ms cudaGetDevice
0.03% 3.3924ms 1750 1.9380us 677ns 67.486us cudaGetLastError
0.02% 2.9515ms 750 3.9350us 520ns 825.60us cudaConfigureCall
0.02% 2.7211ms 3700 735ns 468ns 207.82us cudaSetupArgument
0.01% 1.6918ms 1000 1.6910us 521ns 49.260us cudaPeekAtLastError
0.00% 107.53us 200 537ns 312ns 1.8230us cudaSetDoubleForDevice
0.00% 41.863us 50 837ns 572ns 2.0300us cudaCreateChannelDesc
Sometimes the cudaMallocPitch calls seem to be faster (nothing was changed, I just nvpref-ed the program again) [“profile02.nvvp”]:
======== API calls:
Time(%) Time Calls Avg Min Max Name
45.39% 2.82749s 762 3.7106ms 42.438us 2.40143s cudaLaunch
41.65% 2.59468s 353 7.3504ms 35.096us 2.39360s cudaMemcpy
3.50% 218.11ms 661 329.96us 18.850us 3.1583ms cudaMallocPitch
2.89% 179.98ms 559 321.98us 12.393us 5.7166ms cudaDeviceSynchronize
1.72% 107.05ms 1020 104.95us 33.899us 878.76us cudaLaunchKernel
1.30% 81.289ms 16575 4.9040us 1.3010us 2.2484ms cudaFuncSetCacheConfig
1.19% 74.058ms 854 86.719us 2.0830us 1.3136ms cudaFree
0.62% 38.813ms 816 47.564us 5.8320us 1.9057ms cudaBindTexture
0.49% 30.431ms 51 596.69us 117.42us 1.3181ms cudaMemset2D
0.44% 27.214ms 153 177.87us 99.561us 2.6549ms cudaGetDeviceProperties
0.30% 18.640ms 102 182.74us 62.799us 1.0214ms cudaMalloc
0.17% 10.529ms 102 103.23us 53.061us 698.70us cudaMemcpy2D
0.14% 8.8204ms 765 11.529us 3.0200us 771.96us cudaUnbindTexture
0.06% 3.7410ms 1782 2.0990us 677ns 266.87us cudaGetLastError
0.05% 2.9464ms 763 3.8610us 885ns 1.3580ms cudaGetDevice
0.04% 2.7014ms 3756 719ns 468ns 207.92us cudaSetupArgument
0.02% 1.3655ms 1020 1.3380us 521ns 68.474us cudaPeekAtLastError
0.02% 1.2551ms 762 1.6470us 520ns 55.821us cudaConfigureCall
0.00% 110.91us 204 543ns 312ns 3.0200us cudaSetDoubleForDevice
0.00% 58.943us 51 1.1550us 624ns 16.298us cudaCreateChannelDesc
The workflow of the program:
- Memory is allocated for 4 pointers with cudaMalloc()
- 4 cv::gpu::GpuMat-s are created using the previously malloced pointers assigned as their data pointers
- 2 images are cudaMemcpy()-ed (host to device) from the camera’s buffer to the first 2 GpuMats’ data pointers in thread “GetPics”
- The first image is trimmed with a cv::Rect to get the template in thread “GetPics”
- cv::gpu::matchTemplate() is called on the trimmed GpuMat-s in thread “TemplateMatch”
- When template matching is completed:
Any help/suggestions are greatly appreciated!
Edit: added the nvpref nvvp files.
profiles.zip (1.48 MB)