Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU)
Jetson
• DeepStream Version
6.2
• JetPack Version (valid for Jetson only)
5.1.1
• TensorRT Version
8.5.2.2-1+cuda11.4
• NVIDIA GPU Driver Version (valid for GPU only) • Issue Type( questions, new requirements, bugs)
Questions
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) • Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
Is there a better (faster) way to get surface’s block linear layout frames into pitch linear CUDA array? Currently for each batch I’m mapping the surface with NvBufSurfaceMapEglImage and then for each stream I call cudaGraphicsEGLRegisterImage and cudaGraphicsResourceGetMappedEglFrame and then cudaMemcpy2DFromArrayAsync to copy the data and then finally calling cudaGraphicsUnregisterResource for each stream and at the end of it all NvBufSurfaceUnMapEglImage. From my testing this can be very slow (around 20ms on average for 6 streams with a min of 5ms and a max of 37ms) if we run inference at the same time. If we don’t run any inference min/max is about the same but average drops down to close to min at 5ms. This severely reduces the frame rate we’re able to achieve because we need to be able to cache and later access the frame data. 62% of the total time is spent calling cudaGraphicsUnregisterResource, and 16% of the time is spend calling cudaGraphicsEGLRegisterImage.
Hi @junshengy, my aim is to be able to convert it to an OpenCV Mat on demand. I don’t technically need to convert it to pitch linear at the time of buffering the frames. I only need to do it when I eventually want to convert it to an OpenCV Mat. But I don’t need to do that for every frame so in theory I could buffer block linear memory and only convert it when required.
Basically what we’re doing is buffering all the frames and when we detect and event we go back in history and request OpenCV Mat for those time points.
Would the most efficient approach be to create NvBufSurfaceCreate and then copy over the current batch with NvBufSurfaceCopy. At the time of requesting OpenCV mat I convert block to pitch and then copy it into an OpenCV buffer?
If you only want convert it to OpenCV on demand, there is a samle code to do it.
int videoBlur(int fd) {
NvBufSurface* surface = nullptr;
if (NvBufSurfaceFromFd(fd, (void**)(&surface)) < 0) {
ERROR_MSG("NvBufSurfaceFromFd failed");
return -1;
}
cv::Mat in_mat;
cv::Rect crop_rect;
/* Map the buffer so that it can be accessed by CPU */
if (surface->surfaceList[0].mappedAddr.addr[0] == NULL) {
if (NvBufSurfaceMap(surface, 0, 0, NVBUF_MAP_READ_WRITE) != 0) {
ERROR_MSG("Map for CPU access failed");
return -1;
}
}
/* Invalidate the cache before CPU access the buffer */
if (surface->memType == NVBUF_MEM_SURFACE_ARRAY) {
NvBufSurfaceSyncForCpu(surface, 0, 0);
}
/* Map to cv:Mat with CPU address/width/height/stride of video frame.
* Map to CV_8UC4 as the video frame format is RGBA.
*/
in_mat = cv::Mat(surface->surfaceList[0].planeParams.height[0],
surface->surfaceList[0].planeParams.width[0], CV_8UC4,
surface->surfaceList[0].mappedAddr.addr[0],
surface->surfaceList[0].planeParams.pitch[0]);
/* Apply blur on left-top of the image */
crop_rect = cv::Rect(0, 0, surface->surfaceList[0].planeParams.width[0] / 2,
surface->surfaceList[0].planeParams.height[0] / 2);
/* Apply gaussian blur */
GaussianBlur(in_mat(crop_rect), in_mat(crop_rect), cv::Size(15, 15), 4);
/* Flush cache after CPU access the buffer for hardware device access the
* buffer
*/
if (surface->memType == NVBUF_MEM_SURFACE_ARRAY) {
NvBufSurfaceSyncForDevice(surface, 0, 0);
}
/* UnMap the buffer */
if (NvBufSurfaceUnMap(surface, 0, 0)) {
ERROR_MSG("UnMap failed");
return -1;
}
return 0;
}
You cant got the fd from NvBufSurface surfaceList[batch_index].bufferDesc
The above example can do the conversion from block to pitch.
Remap the hardware buffer can work for your need.
Hi @junshengy, what I meant with “on-demand” is that I want to be able to get a cached frame X second ago (not the current batch). The cache doesn’t have to be cv::Mat, it can be any data type that can eventually be converted to a cv::Mat because I cache far more frames than I convert.
I used the code you suggested above a long time ago, but it’s much slower than my current approach.
No, we already use smart recording, but it’s not what we want in this scenario. I’m probably not explaining myself very well.
We need to be able to get an image at the start of the event. But we don’t know that the event has occurred until later (say for example at the end of a track). In order to be able to do this we need to cache all frames. We can then go back in time and select one or more frames from the event. This is very fast on dGPU but it’s slow on Jetson.
We keep a circular buffer of pre-allocated cv::GpuMat memory. On dGPU all we need to do is call nppiNV12ToBGR_8u_P2C3R_Ctx to convert the NV12 surface to cv::GpuMat BGR memory. Because Jetson is using block linear layout we have to do a lot of transformations. On Jetson we have to map the NV12 surface to an EGLImage and then copy this data into CUDA memory which can then be used to call nppiNV12ToBGR_8u_P2C3R_Ctx. These extra steps have a large overhead.
We need to always cache all frames because we never know when things will happen. But most of these frames will just be discarded because nothing happened. So an acceptable workaround would be to cache a more Jetson friendly surface format and then only convert them when they’re needed. When I find some time I’ll try to create a copy of the NvBufSurface itself and cache that.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
Use CUDA kernel function is the fastest way to convert the NV12 surface to cv::GpuMat BGR memory on Jetson.
But we don’t have a sample like this currently.
You may need to refer to the documentation of CUDA