Best remap implementation on Jetson Nano

Hello,

I need to undistort images as efficiently as possible, at 30 fps. Since the distorsion is constant among frames, I precomputed a map, so that the destination image dst is computed from the source image src by: dst[x,y] = src[map[x,y]] (in pseudo-code).

I wrote my own CUDA kernel to do this, but it looks slower than I expected.
On the other hand, I was told that OpenCV was too slow on Jetson Nano. From what I read in some forums, it looks a remapping function should use Texture Mapping Units, which I’m not familiar with.
What is the most efficient way to implement such a mapping function?
Thanks!

Hi,

Have you checked our VPI library?

https://docs.nvidia.com/vpi/1.2/sample_fisheye.html

Thanks.

Hi AastaLLL,

I confess I wasn’t aware it was available on Jetson Nano. Can it be safely used with gstreamer?

Thanks!

Hi,

Yes, below are some related samples for your reference:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#VPI

Thanks

Hi,

Thanks for the link. Unfortunately, it is still slower than my own implementation: 30 ms/frame to remap a ~14MPixels, rgba8 image.
I was hoping a great improvement when looking at the benchmarks on Jetson Orin, but I’m just running on a Jetson Nano… :-(

According to that benchmark, I could expect some improvement if I could change my image format to nv12_er, but I have some trouble redesigning my gstreamer plugin to do that (I’m not sure the nv12_er VPI format actually corresponds to Gstreamer’s NV12 format).

Let me try with OpenCV at first…

Hi,

When you test VPI, could you help us check the GPU utilization as well?

$ sudo tegrastats

If the GPU is fully utilized, you should already reach the limit for Jetson Nano.

Thanks.

Hi AastaLLL,

Thanks for following up.
To complete my former message, OpenCV with CUDA was actually slower (34.7 ms/frame).

Note that, with any implementation, the GPU is not utilized at 100% when the algorithm is executed in my GStreamer pipeline (although the cameras produce frames faster than the pipeline can process).

This said, here is an extract of the tegrastats output:

RAM 3257/3956MB (lfb 42x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [27%@1036,31%@1036,24%@1036,29%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 0%@921 APE 25 PLL@43C CPU@46C PMIC@50C GPU@43.5C AO@53C thermal@44.5C POM_5V_IN 5515/6132 POM_5V_GPU 1465/1821 POM_5V_CPU 713/862
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [36%@1326,40%@1326,30%@1326,26%@1326] EMC_FREQ 36%@1600 GR3D_FREQ 16%@921 APE 25 PLL@43.5C CPU@46C PMIC@50C GPU@43.5C AO@53C thermal@44.5C POM_5V_IN 6455/6149 POM_5V_GPU 1928/1826 POM_5V_CPU 944/867
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1326,27%@1326,28%@1326,22%@1326] EMC_FREQ 37%@1600 GR3D_FREQ 70%@921 APE 25 PLL@43.5C CPU@45.5C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 6494/6166 POM_5V_GPU 2247/1847 POM_5V_CPU 865/867
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1036,26%@1036,23%@1036,25%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 20%@921 APE 25 PLL@43.5C CPU@45.5C PMIC@50C GPU@43.5C AO@53.5C thermal@44.5C POM_5V_IN 6081/6162 POM_5V_GPU 1658/1838 POM_5V_CPU 868/867
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [35%@1326,32%@1326,23%@1326,25%@1326] EMC_FREQ 37%@1600 GR3D_FREQ 83%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44C AO@53.5C thermal@44.5C POM_5V_IN 6533/6179 POM_5V_GPU 2125/1851 POM_5V_CPU 1062/875
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [44%@1326,33%@1479,23%@1479,24%@1479] EMC_FREQ 37%@1600 GR3D_FREQ 91%@921 APE 25 PLL@43.5C CPU@46.5C PMIC@50C GPU@44C AO@54C thermal@45C POM_5V_IN 6983/6214 POM_5V_GPU 2393/1875 POM_5V_CPU 1176/889
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [29%@1224,30%@1224,20%@1224,18%@1224] EMC_FREQ 36%@1600 GR3D_FREQ 0%@921 APE 25 PLL@43.5C CPU@46C PMIC@50C GPU@44C AO@53C thermal@45C POM_5V_IN 5744/6195 POM_5V_GPU 1426/1856 POM_5V_CPU 751/883
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1428,25%@1428,29%@1428,24%@1428] EMC_FREQ 36%@1600 GR3D_FREQ 55%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 6533/6208 POM_5V_GPU 2046/1864 POM_5V_CPU 903/884
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [30%@1036,31%@1036,30%@921,27%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 70%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44.5C AO@53.5C thermal@44.75C POM_5V_IN 6199/6208 POM_5V_GPU 1776/1860 POM_5V_CPU 908/884
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@921,24%@921,28%@921,31%@921] EMC_FREQ 36%@1600 GR3D_FREQ 46%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 5665/6188 POM_5V_GPU 1545/1849 POM_5V_CPU 713/878
RAM 3260/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1036,19%@1036,33%@1036,26%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 82%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@54C thermal@45.25C POM_5V_IN 6120/6185 POM_5V_GPU 2132/1859 POM_5V_CPU 710/872
RAM 3261/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [36%@1132,27%@1132,32%@1132,27%@1132] EMC_FREQ 36%@1600 GR3D_FREQ 45%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44.5C AO@53.5C thermal@45.5C POM_5V_IN 5853/6174 POM_5V_GPU 1779/1856 POM_5V_CPU 791/869
RAM 3262/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [29%@1428,27%@1428,35%@1428,27%@1428] EMC_FREQ 36%@1600 GR3D_FREQ 81%@921 APE 25 PLL@44.5C CPU@46.5C PMIC@50C GPU@43.5C AO@54C thermal@45C POM_5V_IN 6680/6191 POM_5V_GPU 1964/1859 POM_5V_CPU 1100/877
RAM 3263/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1036,32%@1036,46%@1036,26%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 50%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@53C thermal@45C POM_5V_IN 5972/6183 POM_5V_GPU 1582/1850 POM_5V_CPU 908/878
RAM 3265/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [28%@1479,31%@1479,36%@1479,29%@1479] EMC_FREQ 37%@1600 GR3D_FREQ 97%@921 APE 25 PLL@44C CPU@47C PMIC@50C GPU@44C AO@54C thermal@45.25C POM_5V_IN 6455/6192 POM_5V_GPU 2046/1857 POM_5V_CPU 1141/886
RAM 3268/3956MB (lfb 39x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@921,28%@921,29%@921,27%@921] EMC_FREQ 37%@1600 GR3D_FREQ 4%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44.5C AO@53.5C thermal@45.5C POM_5V_IN 5814/6180 POM_5V_GPU 1582/1848 POM_5V_CPU 751/882
RAM 3270/3956MB (lfb 38x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1326,27%@1326,25%@1326,26%@1326] EMC_FREQ 36%@1600 GR3D_FREQ 68%@921 APE 25 PLL@44.5C CPU@46.5C PMIC@50C GPU@44.5C AO@54C thermal@45.25C POM_5V_IN 6386/6186 POM_5V_GPU 1695/1844 POM_5V_CPU 865/882
RAM 3273/3956MB (lfb 37x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@921,35%@921,39%@921,28%@921] EMC_FREQ 36%@1600 GR3D_FREQ 20%@921 APE 25 PLL@44.5C CPU@47C PMIC@50C GPU@44.5C AO@54C thermal@45.5C POM_5V_IN 5962/6180 POM_5V_GPU 1737/1841 POM_5V_CPU 829/880

As you can see, the actual GPU use fluctuates a lot. It is not the case in my custom test app, but unfortunately, it has to run in the GStreamer context…

Hi,

It looks like there is some room for GPU utilization.

From the tegrastats log, the memory consumption is pretty high.
If you want to further investigate, could you run the Nsight System profiling and share the output with us?

Thansk.

Hi AastaLLL,

Thanks for your follow-up.
Note I could not find an Nsight System version that runs remotely on a PC and yields a visual output, and which supports my OS version (L4T 32.7.1).

Hence, here is nvprof’s text output. I interrupted the pipeline after processing 261 frames:

==24942== Profiling application: gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080,framerate=30/1 ! nvvidconv flip-method=2 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA,width=3024,height=2280 ! mix. nvarguscamerasrc sensor-
==24942== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:   75.90%  6.48994s       524  12.385ms  1.1864ms  33.381ms  cudaGraphicsUnregisterResource
                   17.99%  1.53827s       523  2.9412ms  1.3921ms  60.936ms  cudaGraphicsEGLRegisterImage
                    4.79%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel
                    0.36%  30.852ms       261  118.21us  71.876us  896.01us  cudaCreateTextureObject
                    0.34%  28.902ms       524  55.155us  35.938us  411.41us  cudaStreamSynchronize
                    0.15%  13.133ms       261  50.318us  15.208us  787.41us  cudaDestroyTextureObject
                    0.14%  11.877ms         1  11.877ms  11.877ms  11.877ms  cudaFree
                    0.10%  8.8913ms       260  34.197us  14.792us  965.64us  cudaProfilerStart
                    0.10%  8.3222ms       524  15.882us  4.9480us  338.18us  cudaGraphicsResourceGetMappedEglFrame
                    0.06%  5.0495ms      3217  1.5690us     469ns  120.52us  cudaGetLastError
                    0.06%  4.7432ms       524  9.0510us  3.9590us  71.251us  cudaPointerGetAttributes
                    0.01%  457.20us       261  1.7510us     781ns  107.45us  cudaCreateChannelDesc
                    0.00%  286.62us        65  4.4090us  2.9170us  57.918us  cudaEventDestroy
                    0.00%  79.429us         1  79.429us  79.429us  79.429us  cudaGetDeviceProperties
                    0.00%  33.595us         2  16.797us  15.157us  18.438us  cudaStreamDestroy
                    0.00%  4.4270us         1  4.4270us  4.4270us  4.4270us  cuDeviceGetCount
                    0.00%  1.0940us         1  1.0940us  1.0940us  1.0940us  cudaGetDeviceCount

==24942== NVTX result:
==24942==   Thread "<unnamed>" (id = 637055472)
==24942==     Domain "VPI"
==24942==       Range "sync cuda"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  17.132ms       261  65.639us  42.032us  300.68us  sync cuda
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiImageCreateEGLImageWrapper"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  74.907ms         2  37.453ms  37.176ms  37.731ms  vpiImageCreateEGLImageWrapper
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiImageSetWrappedEGLImage"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  170.28ms       520  327.45us  11.719us  12.747ms  vpiImageSetWrappedEGLImage
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiStreamSync"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  8.68383s       261  33.271ms  26.482ms  403.68ms  vpiStreamSync
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiSubmitRemap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  17.884ms       261  68.521us  11.772us  11.511ms  vpiSubmitRemap
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==   Thread "<unnamed>" (id = 671085040)
==24942==     Domain "VPI"
==24942==       Range "Remap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  8.54852s       261  32.753ms  26.351ms  399.99ms  Remap
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel

==24942==       Range "dispatch"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  474.68ms       261  1.8187ms  218.49us  361.71ms  dispatch
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel

==24942==       Range "map"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  1.53948s       261  5.8984ms  3.1450ms  96.829ms   map
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "unmap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.52813s       261  25.012ms  22.729ms  34.775ms  unmap
No kernels were profiled in this range.
No API activities were profiled in this range.

======== Error: Application returned non-zero code 2

Note that input/output frames are 6048*2280 px large in RGBA, and that the map is quite “distorted” in this case. I noticed the processing time could be reduced up to 25ms when using an identity map (probably because the cache is better optimized), but of course the remap operation is useless in that case.

Thanks.