Using VPI in GStreamer

Hi,

JetPack 5.0.2 installs successfully but nvarguscamerasrc camera source doesn’t work, not due to gstreamer, libgstnvarguscamerasrc.so, but camera driver. The binaries for custom camera driver for IMX477 for Jetson Linux 34.1.1 won’t work for 35.1 and even recompiling custom kernel from sources, camera is still not detected.

v4l2-ctl --list-devices
Cannot open device /dev/video0

ls /dev/video*
empty

dmesg | grep -E “imx477”
empty

python /opt/nvidia/jetson-io/jetson-io.py
only shows IMX274, IMX185 and IMX390, doesn’t show IMX477. With official 5.0.2/35.1 not sure why it doesn’t show up since there is a native driver, but even if it did, still not sure that would work since there is LI-JXAV-MIPI-ADPT-4CAM adapter in between.

Regarding VPI2, once you have the EGLImageKHR to VPIImageData snippet/sample, could try to find a different camera.

Thanks.

Hi,

Thanks for the testing.
We will help to check the IMX477 issue and get back to you.

@predrag12
Default native Orin device tree don’t support IMX477.
I would suggest consult with Leopard to get the device tree and driver for Orin r35.1

Thanks

Hi,

Reading Configuring the CSI Connector, for other Jetsons IMX477 could be set as default camera with /opt/nvidia/jetson-io/jetson-io.py, and looking at the kernel sources which contains nv_imx477.c and imx477_mode_tbls.h, so assumed that there was native driver support for Orin.
The custom LI device tree and driver for IMX477 works for Jetson Linux 34.1.1 (sans nvivafilter) but does not work for 35.1, so long term solution is to wait for such driver update. Meanwhile as short/mid term recourse, might be using one of the other cameras over same LI adapter for which there is native Orin driver. Would Nvidia IMX390 driver work or is it meant for some other adapter?

Thanks.

Hi,

Sorry for the late reply.

Would you mind filing another topic about the camera support of Orin?
Let us focus on the GStreamer + VPI in this topic.

We solved the nvivafilter issue in r35.1.
Please replace the libgstnvivafilter.so with this attachment (17.4 KB).

Then please update the Makefile for Orin GPU architecture.

diff --git a/Makefile b/Makefile
index e6b05fd..043a943 100644
--- a/Makefile
+++ b/Makefile
@@ -111,15 +111,11 @@ LIBRARIES += -L$(TEGRA_LIB_DIR) -lcuda -lrt
 ifneq ($(OS_ARCH),armv7l)
 GENCODE_SM10    := -gencode arch=compute_10,code=sm_10
 endif
-GENCODE_SM30    := -gencode arch=compute_30,code=sm_30
-GENCODE_SM32    := -gencode arch=compute_32,code=sm_32
-GENCODE_SM35    := -gencode arch=compute_35,code=sm_35
-GENCODE_SM50    := -gencode arch=compute_50,code=sm_50
-GENCODE_SMXX    := -gencode arch=compute_50,code=compute_50
 GENCODE_SM53    := -gencode arch=compute_53,code=sm_53
 GENCODE_SM62    := -gencode arch=compute_62,code=sm_62
 GENCODE_SM72    := -gencode arch=compute_72,code=sm_72
-GENCODE_SM_PTX  := -gencode arch=compute_72,code=compute_72
+GENCODE_SM87    := -gencode arch=compute_87,code=sm_87
+GENCODE_SM_PTX  := -gencode arch=compute_87,code=compute_87
 ifeq ($(OS_ARCH),armv7l)
 GENCODE_FLAGS   ?= $(GENCODE_SM32)
 else

Then you should be able to run it with the following pipeline: (testing with IMX274)

$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),format=NV12' ! nvivafilter cuda-process=true pre-process=true post-process=true customer-lib-name="libnvsample_cudaprocess.so" ! 'video/x-raw(memory:NVMM), format=(string)RGBA' ! nv3dsink

We have confirmed the gpu_process will be executed normally.

Following, we are going to make a VPI sample based on nvsample_cudaprocess.
Will let you know once it is completed.

Thanks.

Hi,

Attached is a sample for nvsample_cudaprocess with VPI for your reference.
0001-Add-VPI-support.patch (5.0 KB)

Thanks.

Hi,

We can focus on gstreamer and VPI but there is a dependency in r35.1 on CSI camera (driver) so cannot execute gst-launch-1.0 nvarguscamerasrc exactly the same using USB camera. Alternatively would fixed libgstnvivafilter.so work on 34.1.1 since gstreamer didn’t change, or is IMX274 driver compatible with LI board?

Thank you for the sample. Was not able to run with

gst-launch-1.0 -v v4l2src device=/dev/video0 ! nvvidconv ! ‘video/x-raw(memory:NVMM), width=(int)1280, height=(int)960, framerate=(fraction)30/1, format=(string)NV12’ ! nvivafilter cuda-process=true customer-lib-name=“libnvsample_cudaprocess.so” ! ‘video/x-raw(memory:NVMM), format=(string)RGBA’ ! nv3dsink -e

Invalid eglcolorformat 7
Error: VPI_ERROR_INVALID_IMAGE_FORMAT in nvsample_cudaprocess.cu at line 293 (CUDA: Conversion not implemented between VPIImageFormat(VPI_COLOR_MODEL_RGB,VPI_COLOR_SPEC_UNDEFINED,VPI_MEM_LAYOUT_PITCH_LINEAR,VPI_DATA_TYPE_UNSIGNED,WZYX,X8_Y8_Z8_W8) and VPI_IMAGE_FORMAT_RGBA8)
Error: VPI_ERROR_INVALID_ARGUMENT in nvsample_cudaprocess.cu at line 294 (Input and output images must have the same format)
Error: VPI_ERROR_INVALID_IMAGE_FORMAT in nvsample_cudaprocess.cu at line 297 (CUDA: Conversion not implemented between VPI_IMAGE_FORMAT_RGBA8 and VPIImageFormat(VPI_COLOR_MODEL_RGB,VPI_COLOR_SPEC_UNDEFINED,VPI_MEM_LAYOUT_PITCH_LINEAR,VPI_DATA_TYPE_UNSIGNED,WZYX,X8_Y8_Z8_W8))

If line 275 is changed to VPI_IMAGE_FORMAT_NV12 which would avoid unnecessary NV12->RGBA->NV12 for downstream, then

gst-launch-1.0 -v v4l2src device=/dev/video0 ! nvvidconv ! ‘video/x-raw(memory:NVMM), width=(int)1280, height=(int)960, framerate=(fraction)30/1, format=(string)NV12’ ! nvivafilter cuda-process=true customer-lib-name=“libnvsample_cudaprocess.so” ! ‘video/x-raw(memory:NVMM), format=(string)NV12’ ! nv3dsink -e

Error: VPI_ERROR_INVALID_IMAGE_FORMAT in nvsample_cudaprocess.cu at line 294 (Only image formats with full range are accepted, not VPI_IMAGE_FORMAT_NV12)

If both lines 291-297 are changed to VPI_BACKEND_VIC and line 275 to VPI_IMAGE_FORMAT_NV12 then it works but slowly. Modified the
nvsample_cudaprocess.cu (10.6 KB)
to take one time allocations outside gpu_process, but even then processing is 20ms per 1280x960, which is very slow, so doesn’t afford offloading dewarping to VIC. Should be around 3ms per 1920x1080 according to
VPI - Vision Programming Interface: Remap?

Thanks.

Hi,

Thanks for the feedback.

Is the default pipeline (without the VPI part) work in your environment?

$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),format=NV12' ! nvivafilter cuda-process=true pre-process

There are some issues in the nvivafilter library so you will need to update the library shared above.
We test this on r35.1, not sure if the library can work on r34 or not.

Thanks.

Hi,

The default pipeline with gstreamer nvarguscamerasrc unfortunately does not work since driver doesn’t on r35.1, hence curious how does IMX275 work since it still needs some adapter/driver for J509.

Reverting back to 34.1.1 and replacing libgstnvivafilter.so from r35.1, both CUDA and VIC backends do work, using remap instead of perspective warp nvsample_cudaprocess.cu (11.7 KB).

gst-launch-1.0 -v nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1, format=(string)NV12’ ! nvivafilter cuda-process=true customer-lib-name=“libnvsample_cudaprocess.so” ! ‘video/x-raw(memory:NVMM), format=(string)NV12’ ! nv3dsink -e

However processing is still very slow ~17ms per 1920x1080 with VIC backend and ~14ms with CUDA backend, and even with simpler interpolation type or lower resolution. Comparing to expected according to https://docs.nvidia.com/vpi/algo_remap.html#algo_remap_perf, and comparing to observed cv::cuda::remap, it would appear that there is some unaccounted overhead that is much bigger than processing itself (3ms or 0.4ms). Since VPI operations are issued as stream transactions towards VIC, not sure how to profile the bottleneck?

Thanks.

I’m currently unable to try your case, but be sure to measure period after init time… You may discard measurements from first frames (say 20 frames) and compute for a minimum of 100 frames.

For monitoring VIC activity, you may have a look to sysfs:

sudo su

ls /sys/kernel/debug/vic
cat /sys/kernel/debug/vic/actmon_avg

ls /sys/kernel/debug/bpmp/debug/clk/ | grep vic
cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate

exit

Hi,

Above measurements are average in steady state seconds after the start. Don’t see any logs in above /sys/kernel/debug/ folders, most files are 0 length with very old timestamps. Tried to profile with nsight systems with enabled other accelerators trace, and can see some activity shown as task_submits with channel_id and class_id numbers not naming VIC, that matches the cadence of frames, but doesn’t show any duration or utilization. Also tried nsight compute, but it doesn’t connect to the process, maybe because gst-launch spawns other sub-processes.
If cv::cuda::remap is used under same nvsample_cudaprocess::gpu_process, it averages 2.5ms per 1920x1080 which is slower than what VPI remap performance table states but much faster than measured VPI with CUDA backend.
The above attached code is a just a modification of Nvidia sample, doesn’t require different make file, also could use fakesink instead of nv3dsink just to simplify profiling.

Thanks.

Probably you’re not familiar with sysfs. It is an interface between kernel space where drivers are and userspace where we are when using the kernel. It is organized as directories/pseudo-files (thus size 0). Each driver can expose some parameters, features, as pseudo-files. You can read these from shell with cat, it will give you the current value of the parameter. So :

# This will show what properties VIC driver exposes:
ls /sys/kernel/debug/vic

# This would read current average activity of VIC. 
# May be 0 if not used... launch a gst pipeline in another shell with nnvidconv converting and/or rescaling, you should see an increased value
cat /sys/kernel/debug/vic/actmon_avg

# This would filter from all the clocks the ones related to VIC
ls /sys/kernel/debug/bpmp/debug/clk/ | grep vic

# This would tell current VIC clock FLL rate
cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate

You would explore. As long as you’re just reading, this should be harmless. If writing, be sure you understand what you’re doing. Reading would be enough for monitoring activity.

Hi,

Thanks for the sysfs clarification. Was able to collect readings for remap with VIC backend, Orin set to MAXN.

gst-launch-1.0 -v nvarguscamerasrc ! ‘video/x-raw(memory:NVMM), width=(int)1920, height=(int)1080, framerate=(fraction)30/1, format=(string)NV12’ ! nvivafilter cuda-process=true customer-lib-name=“libnvsample_cudaprocess.so” ! ‘video/x-raw(memory:NVMM), format=(string)NV12’ ! fakesink sync=false -e

cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate
729600000

cat /sys/kernel/debug/vic/actmon_avg
max is 37784 mean is 14698, using instead videotestsrc ! nvvidconv mean is 14155, just using default pipeline (without VPI part) mean is 4345

... launch a gst pipeline in another shell with nnvidconv converting and/or rescaling

Is only libnvsample_cudaprocess vpiSubmitRemap done on VIC, upstream gstreamer conversion and scaling is done on ISP?

Not sure what to make of these readings in terms of bottlenecks, are you maybe able to reproduce VIC latencies using nvsample_cudaprocess.cu attached above?

Thanks.

Hi,

Since you are using VIC for remapping, would you mind maximizing the VIC frequency first?
You can find a script in the below document:

https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks

Hi,

After changing clock.sh --max, latency is smaller ~14ms (from ~17ms) with less variance, but may not be related to VIC since its measurements are the same

cat /sys/kernel/debug/bpmp/debug/clk/nafll_vic/rate
729600000

cat /sys/kernel/debug/vic/actmon_avg
mean is 14660

Either the script doesn’t change the VIC clock or it is already at its max, or as mentioned earlier, slowdown might be in image data transition through VPI to/from VIC. This is not measurable through nsight systems, at least not 2022.2.3, and Nsight compute 2022.2.0 doesn’t not connect or attach to gst-launch-1.0. Based on printf tracing it takes

for 1920x1080
vpiImageCreateWrapper 3.5ms
vpiStreamSync 4.8ms
vpiImageDestroy 1.4ms

for 3840x2160
vpiImageCreateWrapper 11ms
vpiStreamSync 15ms
vpiImageDestroy 3.4ms

about the same time to create and destroy image wrappers (3.5+1.4) as to execute vpiSubmitRemap (4.8 accounted under vpiStreamSync). It is close to linear with number of pixels, just odd that it takes that long.
Since that means it doubles the latency and halves the bandwidth, is there a way to reuse VPIImage wrappers between calls and only change its VPIImageData pointer?

Thanks.

Is there any analysis done as to why does image wrapping take as much time as the entire VIC undistort execution and even more so relative to CUDA backend, and is there a way to reuse image wrapper between calls instead of creating, destroying?

Hi,

Sorry for the late update.

We are still checking the VIC timing issue internally.
Will share more information with you later.

Thanks.

Hi,

We test the source you shared in Sep 2.
There are no execution time or latency output.

Could you share the source that measuring the performance (ex. Sep 15) with us?

Thanks.

Hi,

Attached is the nvsample_cudaprocess.cu (13.8 KB) for 3840x2160, time_taken is in seconds.

Thanks.

Hi,

Sorry for keeping you waiting.
Below is the result we test on Orin + JetPack 5.0.2 and IMX 274 1920x1080p.

...
time_taken 0.000011 line 279
time_taken 0.003356 line 301
time_taken 0.000039 line 317
time_taken 0.003982 line 329
time_taken 0.002067 line 336
time_taken 0.000024 line 352
time_taken 0.000244 line 361
time_taken 0.000225 line 261
time_taken 0.000021 line 270
time_taken 0.000011 line 279
time_taken 0.003405 line 301
time_taken 0.000039 line 317
time_taken 0.003993 line 329
time_taken 0.002054 line 336
time_taken 0.000026 line 352
time_taken 0.000245 line 361
time_taken 0.000244 line 261
time_taken 0.000023 line 270
time_taken 0.000011 line 279
time_taken 0.003314 line 301
time_taken 0.000038 line 317
time_taken 0.004007 line 329
time_taken 0.002021 line 336
time_taken 0.000026 line 352
time_taken 0.000245 line 361
time_taken 0.000266 line 261
time_taken 0.000024 line 270
time_taken 0.000014 line 279
time_taken 0.003457 line 301
...

The remap performance (line 317) takes only around 0.039 ms.
Does this similar to your observation?

We boost the device with sudo ./max_clock.sh --max and use following pipeline for benchmarking:

$ gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),format=NV12' ! nvivafilter cuda-process=true pre-process=true post-process=true customer-lib-name="libnvsample_cudaprocess.so" ! 'video/x-raw(memory:NVMM), format=(string)NV12' ! nv3dsink

Thanks.