Using VPI in GStreamer

Hi,

Above measurements are average in steady state seconds after the start. Don’t see any logs in above /sys/kernel/debug/ folders, most files are 0 length with very old timestamps. Tried to profile with nsight systems with enabled other accelerators trace, and can see some activity shown as task_submits with channel_id and class_id numbers not naming VIC, that matches the cadence of frames, but doesn’t show any duration or utilization. Also tried nsight compute, but it doesn’t connect to the process, maybe because gst-launch spawns other sub-processes.
If cv::cuda::remap is used under same nvsample_cudaprocess::gpu_process, it averages 2.5ms per 1920x1080 which is slower than what VPI remap performance table states but much faster than measured VPI with CUDA backend.
The above attached code is a just a modification of Nvidia sample, doesn’t require different make file, also could use fakesink instead of nv3dsink just to simplify profiling.

Thanks.

Probably you’re not familiar with sysfs. It is an interface between kernel space where drivers are and userspace where we are when using the kernel. It is organized as directories/pseudo-files (thus size 0). Each driver can expose some parameters, features, as pseudo-files. You can read these from shell with cat, it will give you the current value of the parameter. So :

# This will show what properties VIC driver exposes:
ls /sys/kernel/debug/vic

# This would read current average activity of VIC. 
# May be 0 if not used... launch a gst pipeline in another shell with nnvidconv converting and/or rescaling, you should see an increased value
cat /sys/kernel/debug/vic/actmon_avg

# This would filter from all the clocks the ones related to VIC
ls /sys/kernel/debug/bpmp/debug/clk/ | grep vic

# This would tell current VIC clock FLL rate
cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate

You would explore. As long as you’re just reading, this should be harmless. If writing, be sure you understand what you’re doing. Reading would be enough for monitoring activity.

Hi,

Thanks for the sysfs clarification. Was able to collect readings for remap with VIC backend, Orin set to MAXN.

gst-launch-1.0 -v nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1, format=(string)NV12’ ! nvivafilter cuda-process=true customer-lib-name=“libnvsample_cudaprocess.so” ! ‘video/x-raw(memory:NVMM), format=(string)NV12’ ! fakesink sync=false -e

cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate
729600000

cat /sys/kernel/debug/vic/actmon_avg
max is 37784 mean is 14698, using instead videotestsrc ! nvvidconv mean is 14155, just using default pipeline (without VPI part) mean is 4345

... launch a gst pipeline in another shell with nnvidconv converting and/or rescaling

Is only libnvsample_cudaprocess vpiSubmitRemap done on VIC, upstream gstreamer conversion and scaling is done on ISP?

Not sure what to make of these readings in terms of bottlenecks, are you maybe able to reproduce VIC latencies using nvsample_cudaprocess.cu attached above?

Thanks.

Hi,

Since you are using VIC for remapping, would you mind maximizing the VIC frequency first?
You can find a script in the below document:

https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks

Hi,

After changing clock.sh --max, latency is smaller ~14ms (from ~17ms) with less variance, but may not be related to VIC since its measurements are the same

cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate
729600000

cat /sys/kernel/debug/vic/actmon_avg
mean is 14660

Either the script doesn’t change the VIC clock or it is already at its max, or as mentioned earlier, slowdown might be in image data transition through VPI to/from VIC. This is not measurable through nsight systems, at least not 2022.2.3, and Nsight compute 2022.2.0 doesn’t not connect or attach to gst-launch-1.0. Based on printf tracing it takes

for 1920x1080
vpiImageCreateWrapper 3.5ms
vpiStreamSync 4.8ms
vpiImageDestroy 1.4ms

for 3840x2160
vpiImageCreateWrapper 11ms
vpiStreamSync 15ms
vpiImageDestroy 3.4ms

about the same time to create and destroy image wrappers (3.5+1.4) as to execute vpiSubmitRemap (4.8 accounted under vpiStreamSync). It is close to linear with number of pixels, just odd that it takes that long.
Since that means it doubles the latency and halves the bandwidth, is there a way to reuse VPIImage wrappers between calls and only change its VPIImageData pointer?

Thanks.

Is there any analysis done as to why does image wrapping take as much time as the entire VIC undistort execution and even more so relative to CUDA backend, and is there a way to reuse image wrapper between calls instead of creating, destroying?

Hi,

Sorry for the late update.

We are still checking the VIC timing issue internally.
Will share more information with you later.

Thanks.

Hi,

We test the source you shared in Sep 2.
There are no execution time or latency output.

Could you share the source that measuring the performance (ex. Sep 15) with us?

Thanks.

Hi,

Attached is the nvsample_cudaprocess.cu (13.8 KB) for 3840x2160, time_taken is in seconds.

Thanks.

Hi,

Sorry for keeping you waiting.
Below is the result we test on Orin + JetPack 5.0.2 and IMX 274 1920x1080p.

...
time_taken 0.000011 line 279
time_taken 0.003356 line 301
time_taken 0.000039 line 317
time_taken 0.003982 line 329
time_taken 0.002067 line 336
time_taken 0.000024 line 352
time_taken 0.000244 line 361
time_taken 0.000225 line 261
time_taken 0.000021 line 270
time_taken 0.000011 line 279
time_taken 0.003405 line 301
time_taken 0.000039 line 317
time_taken 0.003993 line 329
time_taken 0.002054 line 336
time_taken 0.000026 line 352
time_taken 0.000245 line 361
time_taken 0.000244 line 261
time_taken 0.000023 line 270
time_taken 0.000011 line 279
time_taken 0.003314 line 301
time_taken 0.000038 line 317
time_taken 0.004007 line 329
time_taken 0.002021 line 336
time_taken 0.000026 line 352
time_taken 0.000245 line 361
time_taken 0.000266 line 261
time_taken 0.000024 line 270
time_taken 0.000014 line 279
time_taken 0.003457 line 301
...

The remap performance (line 317) takes only around 0.039 ms.
Does this similar to your observation?

We boost the device with sudo ./max_clock.sh --max and use following pipeline for benchmarking:

$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),format=NV12' ! nvivafilter cuda-process=true pre-process=true post-process=true customer-lib-name="libnvsample_cudaprocess.so" ! 'video/x-raw(memory:NVMM), format=(string)NV12' ! nv3dsink

Thanks.

Hi,

Numbers are very similar but 0.039ms is just submitting undisort on VPI stream, not completion of undistort which is line 329. Creating an 1080p image wrapper and destroying takes 5.335ms > 4.0109ms that takes to undistort. Not sure where transferring an image to VIC actually occurs, line 301 or after 317, but transferring back undistorted image occurs after 317 before 329.

3.314ms vpiImageCreateWrapper line 301
0.0039ms vpiSubmitRemap line 317
4.007ms vpiStreamSync line 329
2.021ms vpiImageDestroy line 336

Thanks.

Hi,

Sorry for the missing.
The remap time is ~ 4ms which is quite close to the VPI document that 1080p + linear interpolation run 3.809±0.003 ms.

If you want to improve the performance of the wrapper, you can try to replace the vpiImageCreateWrapper with vpiImageSetWrapper.
The API can avoid malloc but the user needs to make sure the memory won’t be deallocated on their own.

https://docs.nvidia.com/vpi/group__VPI__Image.html#ga8dd7ba9452d7a2265c09bbd73ffedcb6

Thanks.

Hi,

Than you, vpiImageSetWrapper improves performance, lines 301-336 are now 7ms instead of 9.3ms. There are still 7-3.809=3.2=1.6+1.6ms image transfers to/from VIC, which is for 1920x1080 NV12 ~1.9GB/s. Does 1.9GB/s seem high enough or expected for DDR or NVMM DMA buffers to/from not sure what VIC memory uses?

Thanks.

Hi,

Thanks for the testing.

Is this slower than the vpiImageCreateEGLImageWrapper on JetPack 4/VPI 1?
If yes, could you share the performance with us so we can check it with our internal team?

Thanks.

Hi,

Sorry, can’t check previous version JetPack/VPI performance. vpiImageSetWrapper is faster then vpiImageCreateEGLImageWrapper but vpiSubmtRemap until vpiStreamSync are all on a VPI stream so it is not clear how much individual operations take, other than it involves transfer NVMM->VIC, VIC undistort, transfer VIC->NVMM since gstreamer stream is on NVMM. If VIC uses NVMM and NV12, than there should be no transfer and then undistort takes closer to 7ms not 3.8ms. If VIC uses its own memory or has to convert NV12, then transfer/conversion takes 1.6ms each, which seem slow (1920x1080 12bit) at 1.9GB/s since PCIe 4.0 1 lane is 2GB/s.

Does VIC use NVMM or has its own memory and can use NV12 natively?

Thanks.

Hi,

VIC can use NVMM, usually, it is wrapped as NvBuffer APIs.
Thanks

Hi,

When using vpiImageCreateEGLImageWrapper and vpiImageSetWrapper, do those implicitly use NVMM in the background or has to be explicitly coded in nvsample_cudaprocess.cu code? If it is the former, since gstreamer image frame is already in NVMM, that would mean that undistort takes 7ms. If it is the latter, could you provide a snippet how to wrap EGLImageKHR and call vpiSubmitRemap that keeps it on NVMM?

Thanks.

Hi,

We are confirming this internally.
Will share more information with you later.

Hi,

Thanks for your patience.
We can get better performance by setting VPI Image flags.

diff --git a/nvsample_cudaprocess.cu b/nvsample_cudaprocess.cu
index 35039f7..6464c4e 100644
--- a/nvsample_cudaprocess.cu
+++ b/nvsample_cudaprocess.cu
@@ -294,7 +294,7 @@ gpu_process (EGLImageKHR image, void ** usrptr)
   // start VPI
   data.buffer.egl = image;
 
-  CHECK_VPI_STATUS(vpiImageCreateWrapper(&data, nullptr, 0, &img));
+  CHECK_VPI_STATUS(vpiImageCreateWrapper(&data, nullptr, VPI_BACKEND_VIC|VPI_RESTRICT_MEM_USAGE, &img));
   clock_gettime(CLOCK_MONOTONIC, &end);
   time_taken = (((end.tv_sec - start.tv_sec) * 1e9) + (end.tv_nsec - start.tv_nsec)) * 1e-9;
   start = end;

Previous

time_taken 0.000261 line 261
time_taken 0.000012 line 270
time_taken 0.000012 line 279
time_taken 0.003389 line 301
time_taken 0.000031 line 317
time_taken 0.003992 line 329
time_taken 0.001958 line 336
time_taken 0.000015 line 352
time_taken 0.000224 line 361

Current

time_taken 0.000263 line 261
time_taken 0.000018 line 270
time_taken 0.000009 line 279
time_taken 0.000722 line 301
time_taken 0.000021 line 317
time_taken 0.003937 line 329
time_taken 0.000558 line 336
time_taken 0.000013 line 352
time_taken 0.000209 line 361

The wrapping time (line 301) decreases from ~3.389ms to ~0.710ms.
The destroy time (line 361) also decreases from ~1.958ms to ~0.558ms.

Thanks.

Hi,

Using Jetpack 5.0.2 with its libgstnvivafilter and IMX477, am not getting much improved measurements with VPI_BACKEND_VIC | VPI_RESTRICT_MEM_USAGE in nvsample_cudaprocess.cu (13.8 KB). Overall processing time (entire call) is still between 6-20ms depending on the boosting, so ~2ms (~3ms non boosted) is still spend in some image creation/wrapping, transfer.

Was not able to use vpiImageSetWrapper in conjunction with one time vpiImageCreateWrapper, as it often fails with
Error: VPI_ERROR_INVALID_ARGUMENT in nvsample_cudaprocess.cu at line 300 ((NvError_BadParameter) Can’t convert Image to glsiImage)

For 1920x1080@15

MAXN and clock.sh --restore
time_taken 0.000331 line 261
time_taken 0.000020 line 270
time_taken 0.000017 line 279
time_taken 0.000805 line 301
time_taken 0.000025 line 317
time_taken 0.016899 line 329
time_taken 0.001385 line 336
time_taken 0.000037 line 352
time_taken 0.020015 line 361

MAXN and clock.sh --max
time_taken 0.000272 line 261
time_taken 0.000017 line 270
time_taken 0.000018 line 279
time_taken 0.000771 line 301
time_taken 0.000026 line 317
time_taken 0.003943 line 329
time_taken 0.000565 line 336
time_taken 0.000016 line 352
time_taken 0.005866 line 361

For 3840x2160@30 overall time is between 18-27ms, it is odd that it takes same/similar amount of time ~2.5ms (boosted) for creation/wrapping, transfer even though the image is 4x larger?

time_taken 0.001368 line 261
time_taken 0.000033 line 270
time_taken 0.000032 line 279
time_taken 0.002061 line 301
time_taken 0.000053 line 317
time_taken 0.020705 line 329
time_taken 0.001536 line 336
time_taken 0.000040 line 352
time_taken 0.026782 line 361

clock.sh --max
time_taken 0.000475 line 261
time_taken 0.000010 line 270
time_taken 0.000007 line 279
time_taken 0.000771 line 301
time_taken 0.000020 line 317
time_taken 0.015371 line 329
time_taken 0.000625 line 336
time_taken 0.000013 line 352
time_taken 0.017616 line 361

Since 18ms is more than 1/2 the cycle and couldn’t be used for 60fps, would it be doable to put alternate frames on 2 VPI streams created within the nvsample_cudaprocess::init?
In case of 2x 30fps cameras, so 2x gstreamer graphs in the same process, is nvsample_cudaprocess::init called twice and overwrites its static VPI stream variables, and graphs have to be in separate processes?

Thanks.