Gstreamer CUDA Implementation Low FPS, cudaDeviceSynchronize Load

Hardware is jetson nano a02, jp4.4, ocv4.4, cuda 10.2

Is the inplace filtering of the following example causing the lag?

I modified the example found here Nano not using GPU with gstreamer/python. Slow FPS, dropped frames - #8 by DaneLLL

To use the 1.4MP v4l2 based CSI camera I have. Gstreamer pipe is:
<< "v4l2src device=/dev/video0 name=mysource ! " << "video/x-raw, format=BGRx, width="<< w <<",height="<< h <<" ! " << "nvvidconv name=myconv ! " << "video/x-raw(memory:NVMM), format=RGBA ! " << "nvoverlaysink ";

I can confirm it is running and using GPU based on jtop. The stream is functioning correctly because when I remove the filter->apply (d_mat, d_mat); The stream plays correctly in realtime.

Using nvprof and comparing filtered and nonfiltered methods I see that in the filtered method I spend 70% of my time in cudadevicesynchronize vs 5% without the filter. Whats really strange is that without using the filter, GPU activities shows 100% whereas with the filter GPU activities shows 0%. In both cases jtop shows high GPU loading when executing the program.

I’m not really interested in this specific filter, but I have other matrix operations (fiducial recognition) I would like to do in CUDA with opencv and I am just using this as a test case. So if there is a better method for gstreamer to cuda I can port to that.

NVprof output NO filter:
==14232== Profiling application: ./gst_cv_gpumat ==14232== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 1.3020us 2 651ns 261ns 1.0410us [CUDA memcpy HtoD] API calls: 41.33% 394.97ms 679 581.69us 10.678us 384.18ms cudaFree 28.05% 268.04ms 679 394.76us 310.00us 4.4273ms cuGraphicsEGLRegisterImage 14.91% 142.46ms 679 209.81us 193.02us 551.15us cuGraphicsUnregisterResource 10.42% 99.549ms 2 49.775ms 31.355us 99.518ms cudaMalloc 5.03% 48.084ms 1358 35.408us 24.011us 228.39us cuCtxSynchronize 0.20% 1.8908ms 679 2.7840us 1.9270us 26.406us cuGraphicsResourceGetMappedEglFrame 0.04% 335.73us 191 1.7570us 677ns 55.574us cuDeviceGetAttribute 0.03% 282.97us 2 141.49us 98.543us 184.43us cudaMemcpy2D 0.00% 30.208us 2 15.104us 14.166us 16.042us cuDeviceTotalMem 0.00% 8.6970us 4 2.1740us 1.4060us 3.2290us cuDeviceGetCount 0.00% 5.7300us 1 5.7300us 5.7300us 5.7300us cuInit 0.00% 5.4160us 3 1.8050us 1.4580us 2.2910us cuDeviceGet 0.00% 5.3130us 2 2.6560us 2.4480us 2.8650us cuDeviceGetName 0.00% 2.6040us 1 2.6040us 2.6040us 2.6040us cuDriverGetVersion 0.00% 2.4480us 2 1.2240us 1.1980us 1.2500us cuDeviceGetUuid

Nvprof WITH filter:
0.00% 1.4060us 2 703ns 312ns 1.0940us [CUDA memcpy HtoD]
API calls: 70.87% 8.17729s 761 10.745ms 9.2522ms 13.475ms cudaDeviceSynchronize
22.45% 2.59078s 762 3.4000ms 72.814us 2.49633s cudaLaunchKernel
3.05% 352.17ms 381 924.32us 11.354us 346.05ms cudaFree
1.38% 159.36ms 381 418.26us 322.30us 4.4731ms cuGraphicsEGLRegisterImage
0.96% 110.61ms 2 55.304ms 28.958us 110.58ms cudaMalloc
0.82% 94.885ms 380 249.70us 211.46us 439.12us cuGraphicsUnregisterResource
0.27% 31.163ms 761 40.950us 26.927us 382.77us cuCtxSynchronize
0.14% 16.335ms 1 16.335ms 16.335ms 16.335ms cudaMallocPitch
0.02% 2.5968ms 762 3.4070us 1.9270us 6.7190us cudaGetDevice
0.01% 1.1229ms 381 2.9470us 1.8750us 52.865us cuGraphicsResourceGetMappedEglFrame
0.01% 1.0158ms 762 1.3330us 833ns 62.866us cudaGetLastError
0.00% 461.40us 285 1.6180us 573ns 51.615us cuDeviceGetAttribute
0.00% 287.30us 2 143.65us 93.855us 193.44us cudaMemcpy2D
0.00% 85.418us 1 85.418us 85.418us 85.418us cudaGetDeviceProperties
0.00% 43.335us 3 14.445us 11.615us 17.032us cuDeviceTotalMem
0.00% 12.761us 2 6.3800us 6.3020us 6.4590us cuInit
0.00% 10.311us 5 2.0620us 1.2500us 3.3330us cuDeviceGetCount
0.00% 6.7710us 4 1.6920us 1.3540us 2.1880us cuDeviceGet
0.00% 5.7820us 3 1.9270us 1.6670us 2.2920us cuDeviceGetName
0.00% 4.6360us 2 2.3180us 1.8750us 2.7610us cuDriverGetVersion
0.00% 3.4900us 3 1.1630us 1.0940us 1.2500us cuDeviceGetUuid
0.00% 2.0320us 2 1.0160us 886ns 1.1460us cudaGetDeviceCount

Hi,
Please do profiling through sudo tegrastats. It shows usage of all hardware engines:
https://docs.nvidia.com/jetson/l4t/#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/AppendixTegraStats.html

Please share information about the v4l2 source for reference:

$ v4l2-ctl -d /dev/video0 --list-formats-ext