Hi,
You raise a very interesting point!
Moving the calls to cudaProfilerStart()
and cudaProfilerStop()
, I could see that cudaGraphicsEGLRegisterImage
and cudaGraphicsUnregisterResource
were actually called by vpiStreamSync
after calling vpiSubmitRemap
.
My VPIImage instances were actually retrieved by calling NvEGLImageFromFd
, then vpiImageSetWrappedEGLImage
, which explains why VPI needed these calls to cudaGraphicsEGLRegisterImage
and cudaGraphicsUnregisterResource
under the hood (for the input and output image at each frame, hence they are called twice the number of frames).
I removed my EGL calls to use the following:
if (!client->vpi_image) {
CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(in_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->vpi_image));
}
else {
vpiImageSetWrappedNvBuffer(client->vpi_image, in_dmabuf_fd);
}
if (!client->inter_img) {
CHECK_VPI_STATUS(vpiImageCreateNvBufferWrapper(out_dmabuf_fd, NULL, VPI_BACKEND_CUDA, &client->inter_img));
}
else {
vpiImageSetWrappedNvBuffer(client->inter_img, out_dmabuf_fd);
}
VPIInterpolationType interp = VPI_INTERP_NEAREST;
cudaProfilerStart();
CHECK_VPI_STATUS(vpiSubmitRemap(client->vpi_stream, 0, client->warp, client->vpi_image /*input*/, client->inter_img /* output */, interp, VPI_BORDER_ZERO, 0));
CHECK_VPI_STATUS(vpiStreamSync(client->vpi_stream));
cudaProfilerStop();
Yet, the same calls to cudaGraphicsEGLRegisterImage
and cudaGraphicsUnregisterResource
are still there under the hood, and the spent time remains very similar. Note I need to use the CUDA backend under Jetson Nano.
Below is the nvprof
trace:
==12652== Profiling application: gst-launch-1.0 -e nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080,framerate=30/1 ! nvvidconv flip-method=2 ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA,width=3024,height=2280 ! mix. nvarguscamerasrc sensor-
==12652== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 3.90658s 180 21.703ms 19.599ms 53.521ms void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
API calls: 71.84% 4.59489s 360 12.764ms 1.1739ms 55.696ms cudaGraphicsUnregisterResource
18.97% 1.21350s 359 3.3802ms 1.4161ms 54.672ms cudaGraphicsEGLRegisterImage
8.20% 524.15ms 180 2.9120ms 84.220us 493.76ms cudaLaunchKernel
0.34% 21.437ms 360 59.546us 35.157us 567.67us cudaStreamSynchronize
0.29% 18.405ms 180 102.25us 65.835us 275.79us cudaCreateTextureObject
0.15% 9.2965ms 180 51.647us 15.781us 654.23us cudaDestroyTextureObject
0.09% 5.5484ms 360 15.412us 3.8020us 1.6712ms cudaPointerGetAttributes
0.07% 4.4609ms 360 12.391us 4.9480us 64.740us cudaGraphicsResourceGetMappedEglFrame
0.06% 3.6644ms 2160 1.6960us 521ns 115.47us cudaGetLastError
0.00% 256.10us 180 1.4220us 677ns 27.605us cudaCreateChannelDesc
0.00% 6.6660us 1 6.6660us 6.6660us 6.6660us cuDeviceGetCount
==12652== NVTX result:
==12652== Thread "<unnamed>" (id = 1089819120)
==12652== Domain "VPI"
==12652== Range "sync cuda"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 12.640ms 180 70.224us 42.553us 610.74us sync cuda
No kernels were profiled in this range.
No API activities were profiled in this range.
==12652== Range "vpiStreamSync"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 6.52808s 180 36.267ms 26.118ms 566.90ms vpiStreamSync
No kernels were profiled in this range.
No API activities were profiled in this range.
==12652== Range "vpiSubmitRemap"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 5.4256ms 180 30.142us 15.573us 94.741us vpiSubmitRemap
No kernels were profiled in this range.
No API activities were profiled in this range.
==12652== Thread "<unnamed>" (id = 1115771376)
==12652== Domain "VPI"
==12652== Range "Remap"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 6.45995s 180 35.889ms 25.943ms 566.67ms Remap
GPU activities: 100.00% 3.90658s 180 21.703ms 19.599ms 53.521ms void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
API calls: 100.00% 524.15ms 180 2.9120ms 84.220us 493.76ms cudaLaunchKernel
==12652== Range "dispatch"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 578.08ms 180 3.2116ms 201.77us 506.31ms dispatch
GPU activities: 100.00% 3.90658s 180 21.703ms 19.599ms 53.521ms void nv::vpi::priv::_GLOBAL__N__40_tmpxft_00000e4b_00000000_6_remap_cpp1_ii_8a5b7425::gpuGeoTrans<nv::vpi::priv::QuadHandler<uchar4, VPIInterpolationType=1, nv::cuda::Texture2D<float4>, nv::cuda::IMem<uchar4, nv::cuda::Layout, int=2>>>(uchar4, uint2, nv::cuda::ConstMemRef<float2, nv::cuda::Layout, int=2>, VPIBorderExtension)
API calls: 100.00% 524.15ms 180 2.9120ms 84.220us 493.76ms cudaLaunchKernel
==12652== Range "map"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 1.25549s 180 6.9750ms 3.1349ms 89.987ms map
No kernels were profiled in this range.
No API activities were profiled in this range.
==12652== Range "unmap"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 4.62163s 180 25.676ms 22.392ms 57.072ms unmap
No kernels were profiled in this range.
No API activities were profiled in this range.
I wish I could save time in the calls to cudaGraphicsEGLRegisterImage
and cudaGraphicsUnregisterResource
, but apparently there is no way.
Thanks by advance for any other idea to save time in my remap operation…