Best remap implementation on Jetson Nano

Hello,

I need to undistort images as efficiently as possible, at 30 fps. Since the distorsion is constant among frames, I precomputed a map, so that the destination image dst is computed from the source image src by: dst[x,y] = src[map[x,y]] (in pseudo-code).

I wrote my own CUDA kernel to do this, but it looks slower than I expected.
On the other hand, I was told that OpenCV was too slow on Jetson Nano. From what I read in some forums, it looks a remapping function should use Texture Mapping Units, which I’m not familiar with.
What is the most efficient way to implement such a mapping function?
Thanks!

Hi,

Have you checked our VPI library?

https://docs.nvidia.com/vpi/1.2/sample_fisheye.html

Thanks.

Hi AastaLLL,

I confess I wasn’t aware it was available on Jetson Nano. Can it be safely used with gstreamer?

Thanks!

Hi,

Yes, below are some related samples for your reference:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#VPI

Thanks

Hi,

Thanks for the link. Unfortunately, it is still slower than my own implementation: 30 ms/frame to remap a ~14MPixels, rgba8 image.
I was hoping a great improvement when looking at the benchmarks on Jetson Orin, but I’m just running on a Jetson Nano… :-(

According to that benchmark, I could expect some improvement if I could change my image format to nv12_er, but I have some trouble redesigning my gstreamer plugin to do that (I’m not sure the nv12_er VPI format actually corresponds to Gstreamer’s NV12 format).

Let me try with OpenCV at first…

Hi,

When you test VPI, could you help us check the GPU utilization as well?

$ sudo tegrastats

If the GPU is fully utilized, you should already reach the limit for Jetson Nano.

Thanks.

Hi AastaLLL,

Thanks for following up.
To complete my former message, OpenCV with CUDA was actually slower (34.7 ms/frame).

Note that, with any implementation, the GPU is not utilized at 100% when the algorithm is executed in my GStreamer pipeline (although the cameras produce frames faster than the pipeline can process).

This said, here is an extract of the tegrastats output:

RAM 3257/3956MB (lfb 42x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [27%@1036,31%@1036,24%@1036,29%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 0%@921 APE 25 PLL@43C CPU@46C PMIC@50C GPU@43.5C AO@53C thermal@44.5C POM_5V_IN 5515/6132 POM_5V_GPU 1465/1821 POM_5V_CPU 713/862
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [36%@1326,40%@1326,30%@1326,26%@1326] EMC_FREQ 36%@1600 GR3D_FREQ 16%@921 APE 25 PLL@43.5C CPU@46C PMIC@50C GPU@43.5C AO@53C thermal@44.5C POM_5V_IN 6455/6149 POM_5V_GPU 1928/1826 POM_5V_CPU 944/867
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1326,27%@1326,28%@1326,22%@1326] EMC_FREQ 37%@1600 GR3D_FREQ 70%@921 APE 25 PLL@43.5C CPU@45.5C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 6494/6166 POM_5V_GPU 2247/1847 POM_5V_CPU 865/867
RAM 3257/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1036,26%@1036,23%@1036,25%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 20%@921 APE 25 PLL@43.5C CPU@45.5C PMIC@50C GPU@43.5C AO@53.5C thermal@44.5C POM_5V_IN 6081/6162 POM_5V_GPU 1658/1838 POM_5V_CPU 868/867
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [35%@1326,32%@1326,23%@1326,25%@1326] EMC_FREQ 37%@1600 GR3D_FREQ 83%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44C AO@53.5C thermal@44.5C POM_5V_IN 6533/6179 POM_5V_GPU 2125/1851 POM_5V_CPU 1062/875
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [44%@1326,33%@1479,23%@1479,24%@1479] EMC_FREQ 37%@1600 GR3D_FREQ 91%@921 APE 25 PLL@43.5C CPU@46.5C PMIC@50C GPU@44C AO@54C thermal@45C POM_5V_IN 6983/6214 POM_5V_GPU 2393/1875 POM_5V_CPU 1176/889
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [29%@1224,30%@1224,20%@1224,18%@1224] EMC_FREQ 36%@1600 GR3D_FREQ 0%@921 APE 25 PLL@43.5C CPU@46C PMIC@50C GPU@44C AO@53C thermal@45C POM_5V_IN 5744/6195 POM_5V_GPU 1426/1856 POM_5V_CPU 751/883
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1428,25%@1428,29%@1428,24%@1428] EMC_FREQ 36%@1600 GR3D_FREQ 55%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 6533/6208 POM_5V_GPU 2046/1864 POM_5V_CPU 903/884
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [30%@1036,31%@1036,30%@921,27%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 70%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44.5C AO@53.5C thermal@44.75C POM_5V_IN 6199/6208 POM_5V_GPU 1776/1860 POM_5V_CPU 908/884
RAM 3258/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@921,24%@921,28%@921,31%@921] EMC_FREQ 36%@1600 GR3D_FREQ 46%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@53.5C thermal@44.75C POM_5V_IN 5665/6188 POM_5V_GPU 1545/1849 POM_5V_CPU 713/878
RAM 3260/3956MB (lfb 41x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1036,19%@1036,33%@1036,26%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 82%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@54C thermal@45.25C POM_5V_IN 6120/6185 POM_5V_GPU 2132/1859 POM_5V_CPU 710/872
RAM 3261/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [36%@1132,27%@1132,32%@1132,27%@1132] EMC_FREQ 36%@1600 GR3D_FREQ 45%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44.5C AO@53.5C thermal@45.5C POM_5V_IN 5853/6174 POM_5V_GPU 1779/1856 POM_5V_CPU 791/869
RAM 3262/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [29%@1428,27%@1428,35%@1428,27%@1428] EMC_FREQ 36%@1600 GR3D_FREQ 81%@921 APE 25 PLL@44.5C CPU@46.5C PMIC@50C GPU@43.5C AO@54C thermal@45C POM_5V_IN 6680/6191 POM_5V_GPU 1964/1859 POM_5V_CPU 1100/877
RAM 3263/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [31%@1036,32%@1036,46%@1036,26%@1036] EMC_FREQ 36%@1600 GR3D_FREQ 50%@921 APE 25 PLL@44C CPU@46C PMIC@50C GPU@44C AO@53C thermal@45C POM_5V_IN 5972/6183 POM_5V_GPU 1582/1850 POM_5V_CPU 908/878
RAM 3265/3956MB (lfb 40x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [28%@1479,31%@1479,36%@1479,29%@1479] EMC_FREQ 37%@1600 GR3D_FREQ 97%@921 APE 25 PLL@44C CPU@47C PMIC@50C GPU@44C AO@54C thermal@45.25C POM_5V_IN 6455/6192 POM_5V_GPU 2046/1857 POM_5V_CPU 1141/886
RAM 3268/3956MB (lfb 39x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@921,28%@921,29%@921,27%@921] EMC_FREQ 37%@1600 GR3D_FREQ 4%@921 APE 25 PLL@44C CPU@46.5C PMIC@50C GPU@44.5C AO@53.5C thermal@45.5C POM_5V_IN 5814/6180 POM_5V_GPU 1582/1848 POM_5V_CPU 751/882
RAM 3270/3956MB (lfb 38x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@1326,27%@1326,25%@1326,26%@1326] EMC_FREQ 36%@1600 GR3D_FREQ 68%@921 APE 25 PLL@44.5C CPU@46.5C PMIC@50C GPU@44.5C AO@54C thermal@45.25C POM_5V_IN 6386/6186 POM_5V_GPU 1695/1844 POM_5V_CPU 865/882
RAM 3273/3956MB (lfb 37x4MB) SWAP 34/1978MB (cached 1MB) IRAM 0/252kB(lfb 252kB) CPU [32%@921,35%@921,39%@921,28%@921] EMC_FREQ 36%@1600 GR3D_FREQ 20%@921 APE 25 PLL@44.5C CPU@47C PMIC@50C GPU@44.5C AO@54C thermal@45.5C POM_5V_IN 5962/6180 POM_5V_GPU 1737/1841 POM_5V_CPU 829/880

As you can see, the actual GPU use fluctuates a lot. It is not the case in my custom test app, but unfortunately, it has to run in the GStreamer context…

Hi,

It looks like there is some room for GPU utilization.

From the tegrastats log, the memory consumption is pretty high.
If you want to further investigate, could you run the Nsight System profiling and share the output with us?

Thansk.

Hi AastaLLL,

Thanks for your follow-up.
Note I could not find an Nsight System version that runs remotely on a PC and yields a visual output, and which supports my OS version (L4T 32.7.1).

Hence, here is nvprof’s text output. I interrupted the pipeline after processing 261 frames:

==24942== Profiling application: gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080,framerate=30/1 ! nvvidconv flip-method=2 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA,width=3024,height=2280 ! mix. nvarguscamerasrc sensor-
==24942== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:   75.90%  6.48994s       524  12.385ms  1.1864ms  33.381ms  cudaGraphicsUnregisterResource
                   17.99%  1.53827s       523  2.9412ms  1.3921ms  60.936ms  cudaGraphicsEGLRegisterImage
                    4.79%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel
                    0.36%  30.852ms       261  118.21us  71.876us  896.01us  cudaCreateTextureObject
                    0.34%  28.902ms       524  55.155us  35.938us  411.41us  cudaStreamSynchronize
                    0.15%  13.133ms       261  50.318us  15.208us  787.41us  cudaDestroyTextureObject
                    0.14%  11.877ms         1  11.877ms  11.877ms  11.877ms  cudaFree
                    0.10%  8.8913ms       260  34.197us  14.792us  965.64us  cudaProfilerStart
                    0.10%  8.3222ms       524  15.882us  4.9480us  338.18us  cudaGraphicsResourceGetMappedEglFrame
                    0.06%  5.0495ms      3217  1.5690us     469ns  120.52us  cudaGetLastError
                    0.06%  4.7432ms       524  9.0510us  3.9590us  71.251us  cudaPointerGetAttributes
                    0.01%  457.20us       261  1.7510us     781ns  107.45us  cudaCreateChannelDesc
                    0.00%  286.62us        65  4.4090us  2.9170us  57.918us  cudaEventDestroy
                    0.00%  79.429us         1  79.429us  79.429us  79.429us  cudaGetDeviceProperties
                    0.00%  33.595us         2  16.797us  15.157us  18.438us  cudaStreamDestroy
                    0.00%  4.4270us         1  4.4270us  4.4270us  4.4270us  cuDeviceGetCount
                    0.00%  1.0940us         1  1.0940us  1.0940us  1.0940us  cudaGetDeviceCount

==24942== NVTX result:
==24942==   Thread "<unnamed>" (id = 637055472)
==24942==     Domain "VPI"
==24942==       Range "sync cuda"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  17.132ms       261  65.639us  42.032us  300.68us  sync cuda
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiImageCreateEGLImageWrapper"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  74.907ms         2  37.453ms  37.176ms  37.731ms  vpiImageCreateEGLImageWrapper
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiImageSetWrappedEGLImage"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  170.28ms       520  327.45us  11.719us  12.747ms  vpiImageSetWrappedEGLImage
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiStreamSync"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  8.68383s       261  33.271ms  26.482ms  403.68ms  vpiStreamSync
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "vpiSubmitRemap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  17.884ms       261  68.521us  11.772us  11.511ms  vpiSubmitRemap
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==   Thread "<unnamed>" (id = 671085040)
==24942==     Domain "VPI"
==24942==       Range "Remap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  8.54852s       261  32.753ms  26.351ms  399.99ms  Remap
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel

==24942==       Range "dispatch"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  474.68ms       261  1.8187ms  218.49us  361.71ms  dispatch
 GPU activities:  100.00%  5.74110s       261  21.997ms  20.145ms  32.002ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  409.63ms       261  1.5695ms  90.835us  361.50ms  cudaLaunchKernel

==24942==       Range "map"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  1.53948s       261  5.8984ms  3.1450ms  96.829ms   map
No kernels were profiled in this range.
No API activities were profiled in this range.

==24942==       Range "unmap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.52813s       261  25.012ms  22.729ms  34.775ms  unmap
No kernels were profiled in this range.
No API activities were profiled in this range.

======== Error: Application returned non-zero code 2

Note that input/output frames are 6048*2280 px large in RGBA, and that the map is quite “distorted” in this case. I noticed the processing time could be reduced up to 25ms when using an identity map (probably because the cache is better optimized), but of course the remap operation is useless in that case.

Thanks.

Hi,

Based on the profiling output, cudaGraphicsEGLRegisterImage/cudaGraphicsUnregisterResource is called about 523 times.

Have you tried vpiImageCreateNvBufferWrapper?
In our sample below, we only do this for the first frame and reuse the buffer for VPI.

https://elinux.org/Jetson/L4T/TRT_Customized_Example#VPI_with_Argus_Camera_-_nvarguscamerasrc

Thanks.

Hi,

You raise a very interesting point!
Moving the calls to cudaProfilerStart() and cudaProfilerStop(), I could see that cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource were actually called by vpiStreamSync after calling vpiSubmitRemap.

My VPIImage instances were actually retrieved by calling NvEGLImageFromFd, then vpiImageSetWrappedEGLImage, which explains why VPI needed these calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource under the hood (for the input and output image at each frame, hence they are called twice the number of frames).

I removed my EGL calls to use the following:

	if (!client->vpi_image) {
		CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(in_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->vpi_image));
	}
	else {
		vpiImageSetWrappedNvBuffer(client->vpi_image, in_dmabuf_fd);
	}

	if (!client->inter_img) {
		CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(out_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->inter_img));
	}
	else {
		vpiImageSetWrappedNvBuffer(client->inter_img, out_dmabuf_fd);
	}

	VPIInterpolationType interp = VPI_INTERP_NEAREST;

cudaProfilerStart();
	CHECK_VPI_STATUS(vpiSubmitRemap(client->vpi_stream, 0, client->warp, client->vpi_image /*input*/, client->inter_img /* output */, interp, VPI_BORDER_ZERO, 0));
	CHECK_VPI_STATUS(vpiStreamSync(client->vpi_stream));
cudaProfilerStop();

Yet, the same calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource are still there under the hood, and the spent time remains very similar. Note I need to use the CUDA backend under Jetson Nano.
Below is the nvprof trace:

==12652== Profiling application: gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080,framerate=30/1 ! nvvidconv flip-method=2 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA,width=3024,height=2280 ! mix. nvarguscamerasrc sensor-
==12652== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:   71.84%  4.59489s       360  12.764ms  1.1739ms  55.696ms  cudaGraphicsUnregisterResource
                   18.97%  1.21350s       359  3.3802ms  1.4161ms  54.672ms  cudaGraphicsEGLRegisterImage
                    8.20%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel
                    0.34%  21.437ms       360  59.546us  35.157us  567.67us  cudaStreamSynchronize
                    0.29%  18.405ms       180  102.25us  65.835us  275.79us  cudaCreateTextureObject
                    0.15%  9.2965ms       180  51.647us  15.781us  654.23us  cudaDestroyTextureObject
                    0.09%  5.5484ms       360  15.412us  3.8020us  1.6712ms  cudaPointerGetAttributes
                    0.07%  4.4609ms       360  12.391us  4.9480us  64.740us  cudaGraphicsResourceGetMappedEglFrame
                    0.06%  3.6644ms      2160  1.6960us     521ns  115.47us  cudaGetLastError
                    0.00%  256.10us       180  1.4220us     677ns  27.605us  cudaCreateChannelDesc
                    0.00%  6.6660us         1  6.6660us  6.6660us  6.6660us  cuDeviceGetCount

==12652== NVTX result:
==12652==   Thread "<unnamed>" (id = 1089819120)
==12652==     Domain "VPI"
==12652==       Range "sync cuda"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  12.640ms       180  70.224us  42.553us  610.74us  sync cuda
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "vpiStreamSync"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.52808s       180  36.267ms  26.118ms  566.90ms  vpiStreamSync
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "vpiSubmitRemap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  5.4256ms       180  30.142us  15.573us  94.741us  vpiSubmitRemap
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==   Thread "<unnamed>" (id = 1115771376)
==12652==     Domain "VPI"
==12652==       Range "Remap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.45995s       180  35.889ms  25.943ms  566.67ms  Remap
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel

==12652==       Range "dispatch"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  578.08ms       180  3.2116ms  201.77us  506.31ms  dispatch
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel

==12652==       Range "map"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  1.25549s       180  6.9750ms  3.1349ms  89.987ms   map
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "unmap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  4.62163s       180  25.676ms  22.392ms  57.072ms  unmap
No kernels were profiled in this range.
No API activities were profiled in this range.

I wish I could save time in the calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource, but apparently there is no way.
Thanks by advance for any other idea to save time in my remap operation…

Hi,

Suppose you should need to call the EGL register/unregister once.
But in our profiling data, it was called a hundred times.

Could you check if this is necessary for your use case?
Thanks.

Hi AastaLLL,

I wish I could call them only once. But the difficulty is that my algorithm runs in the context of a GStreamer plugin, so each input (resp. output) frame is supposed to be modified by the plugin before (resp. after) in the pipeline, which may not use EGL.
Correct me if I’m wrong, but my understanding is that these EGL register/unregister calls are necessary to allow CUDA to access the input/output images, independently of how the plugin before/after access it. Note that in the latest version of my VPI test, these EGL register/unregister calls are done by the only 2 lines:

	CHECK_VPI_STATUS(vpiSubmitRemap(client->vpi_stream, 0, client->warp, client->vpi_image /*input*/, client->inter_img /* output */, interp, VPI_BORDER_ZERO, 0));
	CHECK_VPI_STATUS(vpiStreamSync(client->vpi_stream));

Not sure if this is a good hint, but I noticed there was an nvegltransform GStreamer plugin. I tried to put it before mine in my pipeline, but unfortunately it uses a different src pad (of type video/x-raw(memory:EGLImage) and not memory: NVMM), while my plugin was derived from nvcompositor, hence requires an input in memory: NVMM. I could find no use case of this plugin on the forum, so I abandoned this idea.

Another hint would be to write another plugin before to do the EGLRegister, and another after to unregister, but then how can the intermediate plugin retrieve a CUeglFrame from a GStreamer frame id without calling cuGraphicsEGLRegisterImage, then cuGraphicsResourceGetMappedEglFrame?
Maybe the solution is somewhere in this memory:EGLImage pad type as above, but without implementation example I don’t know how to handle that.

Any clarification on the above would surely help.
Thanks.

Hi,

Have you given it a try?

In most of our cases, the pre-/post- processing (which should be the plugin in your case) only modifies the value.
The pointer of the buffer is reused and that’s why the register can be done only once.

Is your case the frame pointer can change?

Thanks.

Hi AastaLLL,

Sorry for the late reply.
Yes, the buffer id changes for both the input and output frames (there are 5 or 6 buffers used in turn).

I tried to store in a dictionary, and reuse in subsequent calls the resource and EGL frame returned by cuGraphicsEGLRegisterImage and cuGraphicsResourceGetMappedEglFrame respectively. The result is that my CUDA kernel fails as soon as I first reuse them with error 719 (unspecified launch failure).

As for the CUDA pointer to the image data, I noticed it could sometimes be the same as in former frames, yet with different data (e.g. the pointer to the output data corresponds to a former pointer to the input data). Hence, I did not store these pointers but rather the resource and EGL frame as explained above.

My understanding is that the register operations are necessary for CUDA to index the memory it handles with its own pointers, but I then need to unregister so that other plugins (not CUDA-based, such as nvvidconv) can work on the frames.

As for writing custom plugins to call EGLRegister then NvDestroyEGLImage before/after the processing itself, I’m not skilled enough in GStreamer to understand how I can retrieve the CUeglFrame pointer in the intermediate plugin (my plugin is currently a modified version of nvcompositor, in which I altered the processing function - was it a bad idea to do so?)
Do you know any use-cases using the nvegltransform plugin?

Thanks.

Hi,

If your buffer changes during the process, registering/unregistering for each frame is required.

nvegltransform is a deepstream plugin so you should find more info in the Deepstream document.
https://docs.nvidia.com/metropolis/deepstream/6.0.1/dev-guide/text/DS_Quickstart.html

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.