Best remap implementation on Jetson Nano

Hi,

You raise a very interesting point!
Moving the calls to cudaProfilerStart() and cudaProfilerStop(), I could see that cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource were actually called by vpiStreamSync after calling vpiSubmitRemap.

My VPIImage instances were actually retrieved by calling NvEGLImageFromFd, then vpiImageSetWrappedEGLImage, which explains why VPI needed these calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource under the hood (for the input and output image at each frame, hence they are called twice the number of frames).

I removed my EGL calls to use the following:

	if (!client->vpi_image) {
		CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(in_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->vpi_image));
	}
	else {
		vpiImageSetWrappedNvBuffer(client->vpi_image, in_dmabuf_fd);
	}

	if (!client->inter_img) {
		CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(out_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->inter_img));
	}
	else {
		vpiImageSetWrappedNvBuffer(client->inter_img, out_dmabuf_fd);
	}

	VPIInterpolationType interp = VPI_INTERP_NEAREST;

cudaProfilerStart();
	CHECK_VPI_STATUS(vpiSubmitRemap(client->vpi_stream, 0, client->warp, client->vpi_image /*input*/, client->inter_img /* output */, interp, VPI_BORDER_ZERO, 0));
	CHECK_VPI_STATUS(vpiStreamSync(client->vpi_stream));
cudaProfilerStop();

Yet, the same calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource are still there under the hood, and the spent time remains very similar. Note I need to use the CUDA backend under Jetson Nano.
Below is the nvprof trace:

==12652== Profiling application: gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080,framerate=30/1 ! nvvidconv flip-method=2 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA,width=3024,height=2280 ! mix. nvarguscamerasrc sensor-
==12652== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:   71.84%  4.59489s       360  12.764ms  1.1739ms  55.696ms  cudaGraphicsUnregisterResource
                   18.97%  1.21350s       359  3.3802ms  1.4161ms  54.672ms  cudaGraphicsEGLRegisterImage
                    8.20%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel
                    0.34%  21.437ms       360  59.546us  35.157us  567.67us  cudaStreamSynchronize
                    0.29%  18.405ms       180  102.25us  65.835us  275.79us  cudaCreateTextureObject
                    0.15%  9.2965ms       180  51.647us  15.781us  654.23us  cudaDestroyTextureObject
                    0.09%  5.5484ms       360  15.412us  3.8020us  1.6712ms  cudaPointerGetAttributes
                    0.07%  4.4609ms       360  12.391us  4.9480us  64.740us  cudaGraphicsResourceGetMappedEglFrame
                    0.06%  3.6644ms      2160  1.6960us     521ns  115.47us  cudaGetLastError
                    0.00%  256.10us       180  1.4220us     677ns  27.605us  cudaCreateChannelDesc
                    0.00%  6.6660us         1  6.6660us  6.6660us  6.6660us  cuDeviceGetCount

==12652== NVTX result:
==12652==   Thread "<unnamed>" (id = 1089819120)
==12652==     Domain "VPI"
==12652==       Range "sync cuda"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  12.640ms       180  70.224us  42.553us  610.74us  sync cuda
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "vpiStreamSync"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.52808s       180  36.267ms  26.118ms  566.90ms  vpiStreamSync
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "vpiSubmitRemap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  5.4256ms       180  30.142us  15.573us  94.741us  vpiSubmitRemap
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==   Thread "<unnamed>" (id = 1115771376)
==12652==     Domain "VPI"
==12652==       Range "Remap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  6.45995s       180  35.889ms  25.943ms  566.67ms  Remap
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel

==12652==       Range "dispatch"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  578.08ms       180  3.2116ms  201.77us  506.31ms  dispatch
 GPU activities:  100.00%  3.90658s       180  21.703ms  19.599ms  53.521ms  void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
      API calls:  100.00%  524.15ms       180  2.9120ms  84.220us  493.76ms  cudaLaunchKernel

==12652==       Range "map"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  1.25549s       180  6.9750ms  3.1349ms  89.987ms   map
No kernels were profiled in this range.
No API activities were profiled in this range.

==12652==       Range "unmap"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  4.62163s       180  25.676ms  22.392ms  57.072ms  unmap
No kernels were profiled in this range.
No API activities were profiled in this range.

I wish I could save time in the calls to cudaGraphicsEGLRegisterImage and cudaGraphicsUnregisterResource, but apparently there is no way.
Thanks by advance for any other idea to save time in my remap operation…