Allocate CUDA host memory and copy NVBuffer Image into it

Hi Folks,

I intend to perform some basic operations on GPU. I have followed a code snippet to make the operation work on single dimensional float arrays. However, I would like to perform similar operation on an Image captured from camera. I am currently reading the frames from camera using the following code :

UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);
        if (!iFrame)

        // Get the Frame's Image.
        Image *image = iFrame->getImage();
        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);
        if (status != STATUS_OK)
               TEST_ERROR_RETURN(status != STATUS_OK, "Failed to create a native buffer");

 #if 1

        NvBufferParams params;
        NvBufferGetParams(fd, &params);

 	char *data_mem = NULL;
	int size = m_framesize.width* m_framesize.height;

        int fsize = params.pitch[0] * m_framesize.height ;
        data_mem = (char*)mmap(NULL, fsize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, params.offset[0]);

I have been successful in extracting the Y-channel image which I intend to use for my operation. However, I would like to make use of “Zero Copy” capability of TX1 and map the CPU memory pointer to GPU memory pointer which I am finding hard to accomplish using cudaHostAlloc.

Is there an easy way to map CPU memory pointer ( data_mem ) to GPU memory pointer which can be used to process the frame using CUDA code ? Should I use cudaMemCpy to first copy the frame buffer using data_mem and then perform the operations ? Is there another way to perform the operations on the frame without any copy required so as to minimize run time ?



We have zero-copy sample located at ‘/home/ubuntu/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy’.

cudaHostRegister() can register CPU memory point to CUDA but not support ARM platform.
An alternative may be unified memory. Could you check if unified memory can solve your problem?


Hi AastaLLL,

Thanks for the reply. I tried using unified memory for allocation of input and output arrays. However, I am facing some difficulty with conversion of character buffer derived from NVBuffer to float buffer which can be used for GPU operations. Can you suggest a way to read an input buffer as float buffer ?



Try to modify here:

In '/home/ubuntu/tegra_multimedia_api/samples/common/classes/NvBuffer.cpp

planes[j].data = new unsigned char [planes[j].length];

Hi AastaLLL,

I tried declaring unified memory by replacing this statement in NvBuffer::allocateMemory() by the following :

cudaMallocManaged(&planes[j].data, planes[j].length);

However there seems to be no change. If I comment out the line and do not allocate any memory still the changes do not seem to reflect. While building I can see the NvBuffer.cpp being built again.

Does the following call invoke NvBuffer::allocateMemory() ?

int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);

Could you please let me know where I am going wrong ? Kindly help me out.



Sorry for the late response.

Creation may direct to NvBuffer’s constructor rather than function allocateMemory().
Please check it via log printing.