Allocate CUDA host memory and copy NVBuffer Image into it

Hi Folks,

I intend to perform some basic operations on GPU. I have followed a code snippet to make the operation work on single dimensional float arrays. However, I would like to perform similar operation on an Image captured from camera. I am currently reading the frames from camera using the following code :

UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);
        if (!iFrame)
            break;

        // Get the Frame's Image.
        Image *image = iFrame->getImage();
        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);
        if (status != STATUS_OK)
               TEST_ERROR_RETURN(status != STATUS_OK, "Failed to create a native buffer");

 #if 1

        NvBufferParams params;
        NvBufferGetParams(fd, &params);

 	char *data_mem = NULL;
	int size = m_framesize.width* m_framesize.height;

        int fsize = params.pitch[0] * m_framesize.height ;
        data_mem = (char*)mmap(NULL, fsize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, params.offset[0]);

I have been successful in extracting the Y-channel image which I intend to use for my operation. However, I would like to make use of “Zero Copy” capability of TX1 and map the CPU memory pointer to GPU memory pointer which I am finding hard to accomplish using cudaHostAlloc.

Is there an easy way to map CPU memory pointer ( data_mem ) to GPU memory pointer which can be used to process the frame using CUDA code ? Should I use cudaMemCpy to first copy the frame buffer using data_mem and then perform the operations ? Is there another way to perform the operations on the frame without any copy required so as to minimize run time ?

Thanks.

Hi,

We have zero-copy sample located at ‘/home/ubuntu/NVIDIA_CUDA-8.0_Samples/0_Simple/simpleZeroCopy’.

cudaHostRegister() can register CPU memory point to CUDA but not support ARM platform.
An alternative may be unified memory. Could you check if unified memory can solve your problem?
https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/

Thanks.

Hi AastaLLL,

Thanks for the reply. I tried using unified memory for allocation of input and output arrays. However, I am facing some difficulty with conversion of character buffer derived from NVBuffer to float buffer which can be used for GPU operations. Can you suggest a way to read an input buffer as float buffer ?

Thanks

Hi,

Try to modify here:

In '/home/ubuntu/tegra_multimedia_api/samples/common/classes/NvBuffer.cpp

planes[j].data = new unsigned char [planes[j].length];

Hi AastaLLL,

I tried declaring unified memory by replacing this statement in NvBuffer::allocateMemory() by the following :

cudaSetDeviceFlags(cudaDeviceMapHost);
cudaMallocManaged(&planes[j].data, planes[j].length);

However there seems to be no change. If I comment out the line and do not allocate any memory still the changes do not seem to reflect. While building I can see the NvBuffer.cpp being built again.

Does the following call invoke NvBuffer::allocateMemory() ?

int fd = iImageNativeBuffer->createNvBuffer(Argus::Size {m_framesize.width, m_framesize.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);

Could you please let me know where I am going wrong ? Kindly help me out.

Thanks.

Hi,

Sorry for the late response.

Creation may direct to NvBuffer’s constructor rather than function allocateMemory().
Please check it via log printing.

Thanks.