How to copy GPU buffer(from nppiMalloc_8u_C4), to dmabuf_fd(from NvBufferCreateEx)


Here is my way to copy, but it is to slow. Is there a way to device_to_device?

init() and process() can work fine with one camera(60fps), function process() just cost 2-3ms.
But when I use 4 cameras(each has 60fps), function process() cost 19 ms.


NvBufferGetParams(m_argb_dmabuf_fd,&m_argb_parm);//pitch 9984
NvBufferMemMap(m_argb_dmabuf_fd, 0, NvBufferMem_Read_Write, &m_argb_dmabuf_buffer);

vesc_bayer_dev_ = nppiMalloc_8u_C1(CAMERA_IMAGE_WIDTH, CAMERA_IMAGE_HEIGHT, &resv_bayer_step_);//pitch 2560
resv_rgba_dev_ = nppiMalloc_8u_C4(CAMERA_IMAGE_WIDTH, CAMERA_IMAGE_HEIGHT, &resv_rgba_step_);  //pitch 10240



NvBufferMemSyncForCpu(m_argb_dmabuf_fd, 0, &m_argb_dmabuf_buffer);

cudaMemcpy2D(m_argb_dmabuf_buffer, m_argb_parm.pitch[0], resv_rgba_dev_, 24484, 24484, 2048, cudaMemcpyDeviceToHost);

NvBufferMemSyncForDevice(m_argb_dmabuf_fd, 0, &m_argb_dmabuf_buffer);


I am trying blow code, but it doesn’t work.
cudaMemcpy2D(m_argb_parm.nv_buffer, m_argb_parm.pitch[0], resv_rgba_dev_, 24484, 24484, 2048, cudaMemcpyDeviceToDevice);


m_argb_dmabuf_fd looks like a CPU pinned buffer to me.
Could you give cudaMemcpyDeviceToHost a try.

If it is still not working, would you mind to provide a simple reproducible source for us checking?

Hi afhel,

Have you managed to get it working? Any result can be shared?


No,It is not worked.
The width of bayergb from camera is 2448, but in nppc width is 2560.
And The width of rgba in nppc is 10240, but in m_argb_dmabuf_buffer is 9984.

So I have to copy them one line by one line,using cudaMemcpy2D.

cudaMemcpy2D spend a lot of time.


I will provide a simple a few days later.


Thanks for the update.
A simple reproducible source will definitely help us figure out this issue.