I’m running an OpenCV function, BackgroundSubtractionMOG2, on the GPU so that my app can keep up with a 4K (3840x2160) 30fps video stream. It is keeping up but I’m looking for additional performance improvements so I ran nprof to see where the time was going in the GPU:
Upload of 8 MB video frame from CPU to GPU : 2.2 ms
Execution of MOG2 function on the GPU : 24.7 ms
Download of 8 MB video frame from GPU to CPU : 2.3 ms
Total of 29.2 ms
That all seems reasonable to me. I also used Python’s cProfile to see where the time is going in the Python code that invokes the GPU function:
Upload of 8 MB video frame from CPU to GPU : 3.4 ms
Execution of MOG2 function on the GPU : 1.0 ms
Download of 8 MB video frame from GPU to CPU : 24.8 ms
Total of 29.2 ms
The total time spent as reported by both profilers is the same: 29.2 ms. However the Python profiler reports much less time executing the MOG2 function and much more time downloading the result frame from GPU to CPU. I don’t understand why the two profilers give such different results. They’re both looking at the same operation, one from the CPU perspective and one from the GPU perspective.
This is my first time working with a GPU. Can anyone explain why I see very different profile results for the same operation?
More details below:
System
Jetson Xavier NX
Jetpack 4.4
L4T 32.4.3
OpenCV 4.4.0 with CUDA support
Python 3.6.9
Python code
bg_subtractor_gpu = cv2.cuda.createBackgroundSubtractorMOG2()
gray_frame_GpuMat = cv2.cuda_GpuMat(gray_frame)
motion_frame_GpuMat = bg_subtractor_gpu.apply(gray_frame_GpuMat, 0.05, None)
motion_frame = motion_frame_GpuMat.download()
nprof
==27068== NVPROF is profiling process 27068, command: python3 opencv-2.py
==27068== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: Programming Guide :: CUDA Toolkit Documentation
==27068== Profiling application: python3 opencv-2.py
==27068== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 83.77% 29.6935s 1201 24.724ms 4.5370ms 88.958ms void cv::cuda::device::mog2::mog2<bool=1, unsigned char, float>(cv::cuda::PtrStepSz, cv::cuda::PtrStep, cv::cuda::PtrStep, cv::cuda::PtrStepSz<unsigned char, float>, cv::cuda::PtrStep, cv::cuda::PtrStepSz<unsigned char, float>, float, float, float, cv::cuda::device::mog2::Constants const *)
7.85% 2.78115s 1201 2.3157ms 1.4187ms 14.071ms [CUDA memcpy DtoH]
7.57% 2.68295s 1202 2.2321ms 992ns 18.116ms [CUDA memcpy HtoD]
0.81% 287.93ms 1202 239.54us 179.84us 451.52us [CUDA memset]
API calls: 89.17% 36.5202s 2402 15.204ms 1.7786ms 92.838ms cudaMemcpy2D
6.27% 2.56713s 2406 1.0670ms 427.42us 74.589ms cudaMallocPitch
2.49% 1.01861s 2407 423.19us 179.10us 10.590ms cudaFree
0.78% 319.26ms 1 319.26ms 319.26ms 319.26ms cudaMalloc
0.57% 235.32ms 1201 195.94us 103.30us 8.9274ms cudaMemset2DAsync
0.32% 129.19ms 1201 107.57us 53.088us 7.9606ms cudaLaunchKernel
0.18% 73.468ms 2402 30.586us 10.847us 1.7866ms cudaStreamDestroy
0.17% 71.233ms 2402 29.655us 6.9760us 1.7260ms cudaStreamCreate
0.04% 17.129ms 1201 14.262us 9.4080us 503.32us cudaFuncSetCacheConfig
0.01% 3.2540ms 1201 2.7090us 1.4400us 109.44us cudaGetLastError
0.00% 517.69us 191 2.7100us 1.1840us 95.456us cuDeviceGetAttribute
0.00% 326.11us 1 326.11us 326.11us 326.11us cudaMemset2D
0.00% 96.831us 1 96.831us 96.831us 96.831us cudaMemcpyAsync
0.00% 36.320us 2 18.160us 10.112us 26.208us cuDeviceTotalMem
0.00% 18.304us 1 18.304us 18.304us 18.304us cuDriverGetVersion
0.00% 14.815us 1 14.815us 14.815us 14.815us cuInit
0.00% 14.079us 4 3.5190us 1.8240us 6.2080us cuDeviceGetCount
0.00% 12.992us 1 12.992us 12.992us 12.992us cudaGetDevice
0.00% 7.1360us 3 2.3780us 2.1120us 2.7200us cuDeviceGet
0.00% 5.9200us 2 2.9600us 2.4000us 3.5200us cuDeviceGetName
0.00% 4.1920us 2 2.0960us 2.0800us 2.1120us cuDeviceGetUuid
0.00% 1.7920us 1 1.7920us 1.7920us 1.7920us cudaGetDeviceCount
cProfile
Fri Jul 24 14:51:57 2020 file.prof
118559 function calls (115959 primitive calls) in 54.135 seconds
Ordered by: internal time
List reduced from 1003 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1201 29.741 0.025 29.741 0.025 {method ‘download’ of ‘cv2.cuda_GpuMat’ objects}
1201 5.111 0.004 5.111 0.004 {cvtColor}
3953 4.543 0.001 4.543 0.001 {built-in method time.sleep}
1201 4.073 0.003 4.073 0.003 {method ‘upload’ of ‘cv2.cuda_GpuMat’ objects}
1201 3.720 0.003 3.720 0.003 {erode}
1201 1.703 0.001 1.703 0.001 {threshold}
1201 1.243 0.001 1.243 0.001 {method ‘apply’ of ‘cv2.cuda_BackgroundSubtractorMOG2’ objects}
1 0.997 0.997 0.997 0.997 opencv-2.py:11(init)
1201 0.927 0.001 46.635 0.039 opencv-2.py:117(process_frame_gpu_1)
1 0.447 0.447 53.280 53.280 opencv-2.py:141(main)