Nprof and cProfile performance profilers reporting very different durations for GPU function invocation and GPU->CPU download

I’m running an OpenCV function, BackgroundSubtractionMOG2, on the GPU so that my app can keep up with a 4K (3840x2160) 30fps video stream. It is keeping up but I’m looking for additional performance improvements so I ran nprof to see where the time was going in the GPU:

Upload of 8 MB video frame from CPU to GPU : 2.2 ms
Execution of MOG2 function on the GPU : 24.7 ms
Download of 8 MB video frame from GPU to CPU : 2.3 ms
Total of 29.2 ms

That all seems reasonable to me. I also used Python’s cProfile to see where the time is going in the Python code that invokes the GPU function:

Upload of 8 MB video frame from CPU to GPU : 3.4 ms
Execution of MOG2 function on the GPU : 1.0 ms
Download of 8 MB video frame from GPU to CPU : 24.8 ms
Total of 29.2 ms

The total time spent as reported by both profilers is the same: 29.2 ms. However the Python profiler reports much less time executing the MOG2 function and much more time downloading the result frame from GPU to CPU. I don’t understand why the two profilers give such different results. They’re both looking at the same operation, one from the CPU perspective and one from the GPU perspective.

This is my first time working with a GPU. Can anyone explain why I see very different profile results for the same operation?

More details below:


Jetson Xavier NX
Jetpack 4.4
L4T 32.4.3
OpenCV 4.4.0 with CUDA support
Python 3.6.9

Python code

bg_subtractor_gpu = cv2.cuda.createBackgroundSubtractorMOG2()
gray_frame_GpuMat = cv2.cuda_GpuMat(gray_frame)
motion_frame_GpuMat = bg_subtractor_gpu.apply(gray_frame_GpuMat, 0.05, None)
motion_frame =


==27068== NVPROF is profiling process 27068, command: python3
==27068== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at:
==27068== Profiling application: python3
==27068== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 83.77% 29.6935s 1201 24.724ms 4.5370ms 88.958ms void cv::cuda::device::mog2::mog2<bool=1, unsigned char, float>(cv::cuda::PtrStepSz, cv::cuda::PtrStep, cv::cuda::PtrStep, cv::cuda::PtrStepSz<unsigned char, float>, cv::cuda::PtrStep, cv::cuda::PtrStepSz<unsigned char, float>, float, float, float, cv::cuda::device::mog2::Constants const *)
7.85% 2.78115s 1201 2.3157ms 1.4187ms 14.071ms [CUDA memcpy DtoH]
7.57% 2.68295s 1202 2.2321ms 992ns 18.116ms [CUDA memcpy HtoD]
0.81% 287.93ms 1202 239.54us 179.84us 451.52us [CUDA memset]
API calls: 89.17% 36.5202s 2402 15.204ms 1.7786ms 92.838ms cudaMemcpy2D
6.27% 2.56713s 2406 1.0670ms 427.42us 74.589ms cudaMallocPitch
2.49% 1.01861s 2407 423.19us 179.10us 10.590ms cudaFree
0.78% 319.26ms 1 319.26ms 319.26ms 319.26ms cudaMalloc
0.57% 235.32ms 1201 195.94us 103.30us 8.9274ms cudaMemset2DAsync
0.32% 129.19ms 1201 107.57us 53.088us 7.9606ms cudaLaunchKernel
0.18% 73.468ms 2402 30.586us 10.847us 1.7866ms cudaStreamDestroy
0.17% 71.233ms 2402 29.655us 6.9760us 1.7260ms cudaStreamCreate
0.04% 17.129ms 1201 14.262us 9.4080us 503.32us cudaFuncSetCacheConfig
0.01% 3.2540ms 1201 2.7090us 1.4400us 109.44us cudaGetLastError
0.00% 517.69us 191 2.7100us 1.1840us 95.456us cuDeviceGetAttribute
0.00% 326.11us 1 326.11us 326.11us 326.11us cudaMemset2D
0.00% 96.831us 1 96.831us 96.831us 96.831us cudaMemcpyAsync
0.00% 36.320us 2 18.160us 10.112us 26.208us cuDeviceTotalMem
0.00% 18.304us 1 18.304us 18.304us 18.304us cuDriverGetVersion
0.00% 14.815us 1 14.815us 14.815us 14.815us cuInit
0.00% 14.079us 4 3.5190us 1.8240us 6.2080us cuDeviceGetCount
0.00% 12.992us 1 12.992us 12.992us 12.992us cudaGetDevice
0.00% 7.1360us 3 2.3780us 2.1120us 2.7200us cuDeviceGet
0.00% 5.9200us 2 2.9600us 2.4000us 3.5200us cuDeviceGetName
0.00% 4.1920us 2 2.0960us 2.0800us 2.1120us cuDeviceGetUuid
0.00% 1.7920us 1 1.7920us 1.7920us 1.7920us cudaGetDeviceCount


Fri Jul 24 14:51:57 2020

     118559 function calls (115959 primitive calls) in 54.135 seconds

Ordered by: internal time
List reduced from 1003 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function)
1201 29.741 0.025 29.741 0.025 {method ‘download’ of ‘cv2.cuda_GpuMat’ objects}
1201 5.111 0.004 5.111 0.004 {cvtColor}
3953 4.543 0.001 4.543 0.001 {built-in method time.sleep}
1201 4.073 0.003 4.073 0.003 {method ‘upload’ of ‘cv2.cuda_GpuMat’ objects}
1201 3.720 0.003 3.720 0.003 {erode}
1201 1.703 0.001 1.703 0.001 {threshold}
1201 1.243 0.001 1.243 0.001 {method ‘apply’ of ‘cv2.cuda_BackgroundSubtractorMOG2’ objects}
1 0.997 0.997 0.997 0.997
1201 0.927 0.001 46.635 0.039
1 0.447 0.447 53.280 53.280

I’ve been studying the source code and I believe this is what’s happening:

gray_frame_GpuMat = cv2.cuda_GpuMat(gray_frame)
The CPU creates a GpuMat and uploads the contents of the video frame to the GPU

motion_frame_GpuMat = bg_subtractor_gpu.apply(gray_frame_GpuMat, 0.05, None)
The CPU launches execution of the MOG2 kernel asynchronously and moves on. The GPU starts the calculations and blocks other activity, such as downloads, until the calculations are complete.

motion_frame =
The CPU issues the command to download the results from the GPU but that GPU activity is blocked because the kernel calculations aren’t complete yet. When the calculations are complete the GPU-CPU download begins

The Python profiler reports an unusually short duration for the MOG2 apply call because it launched the GPU calculations asynchronously and didn’t wait for the result. The Python profiler reports an unusually long duration for the GPU-CPU download because the download is blocked in the GPU until the calculations are complete.

I verified this by inserting a delay into my code to give the GPU time to complete the calculations and unblock before the CPU attempts the download:

bg_subtractor_gpu = cv2.cuda.createBackgroundSubtractorMOG2()
gray_frame_GpuMat = cv2.cuda_GpuMat(gray_frame)
motion_frame_GpuMat = bg_subtractor_gpu.apply(gray_frame_GpuMat, 0.05, None)
motion_frame =

Now the Python profiler tells the truth:

Upload of 8 MB video frame from CPU to GPU : 3.4 ms
Download of 8 MB video frame from GPU to CPU : 3.4 ms

Please run sudo jetson_clocks and profile with sudo tegrastats
Let’s also compare with tegrastats


I think I figured it out. Part of my problem was that I didn’t know anything about running code on a GPU. I’ve been working on that by reading “Hands-On GPU Programming with Python and CUDA”. I also found this post to be particularly useful:

The other part of my problem was that I didn’t know what to specify for the Stream argument when launching the mog2 kernel and so I specified None. This didn’t immediately blow up in my face so I concluded it must be a legitimate way of indicating that I didn’t want to specify a Stream. Now I can see that specifying None somehow caused 150 new streams to be defined and the mog2 kernel was launched on Stream 137. Since that wasn’t the Default stream the API didn’t call cudaDeviceSynchronize after launching the mog2 kernel and the CPU moved on and started the GPU->CPU download. This download didn’t start until the kernel was finished because the kernel code had blocked the device.

Once I learned that Stream_Null() was the proper Stream argument to mog2 the kernel was launched on the Default stream and cudaDeviceSynchronize was called. The results of the Python profiler and nvprof are now in agreement.

I’d attach the Visual Profiler files that illustrate both scenarios but I don’t see a way to attach files to my reply. If you want them send an e-mail address to