Encode frames using v4l2 and NvBufSurface

My system:

  • Jetson Xavier NX
  • Jetson Linux 35.2.1

For my application I have some frames on cuda memory which I want to encode them. Using the MMAPI video encoding sample I could encode my video correctly.
However according to the sample, before enqueuing the frames using v4l2_buf, I have to transfer the frames to NvBuffer* array, which is on CPU, so technically I need a slow copy between GPU and CPU.
I also checked the possibility to use NvBufSurface directly (without NvBuffer) but that also seems a CPU-accessible structure (NvBufSurfaceMappedAddr is mentioned as “holds planewise pointers to a CPU mapped buffer”).
Now my question is how should I transfer my data to v4l2 without touching CPU memory and rely only on DeviceToDevice copy? My data is a yuv image kept in 3 different uchar* arrays.

In summary this is the MMAPI sample workflow:

file_read(cpu) → NvBuffer(cpu) → NvBufSurfaceSyncForDevice (cpu->gpu) → v4l2_buf (gpu)

But my initial data is on GPU memory, so how should I change the above pipeline?
Thank you.

Hi,
Yo may refer to the sample:

/usr/src/jetson_multimedia_api/samples/03_video_cuda_enc

NvBufSurface can be mapped to GPU and you can GPU-to-GPU memory copy to have frame data in NvBufSurface.

And we have later Jetpack release. Would suggest upgrade to latest 5.1.3(r35.5.0)

Thank you I am checking that sample as well. In that one also the initial data is grasped by CPU and then mapped to GPU.
I am wondering how can I load the NvBuffer on GPU even before mapping by NvBufSurface?
Can I map NvBufSurface to GPU in advanced, and then copy my GPU data to NvBufSurfaceMappedAddr directly (and omit the NvBuffer)?

Hi,
The 03 sample demonstrates how to access NvBufSurface through GPU and CPU. You may ignore the CPU part and focus on GPU part.

Thank you, I am trying to get the key point here.
This is the main part of the 03 sample that I am looking at:

struct v4l2_buffer v4l2_buf;
struct v4l2_plane planes[MAX_PLANES];
NvBuffer *buffer = ctx.enc->output_plane.getNthBuffer(i);
memset(&v4l2_buf, 0, sizeof(v4l2_buf));
memset(planes, 0, MAX_PLANES * sizeof(struct v4l2_plane));
v4l2_buf.index = i;
v4l2_buf.m.planes = planes;

read_video_frame(ctx.in_file, *buffer); // This gets the data on CPU?!

ret = sync_buf (buffer); // This is for CPU so I ignore!

ret = render_rect (&ctx, buffer); // This maps the data on EGL

ret = ctx.enc->output_plane.qBuffer(v4l2_buf, NULL);

,
I understand ignoring sync_buf (cpu) and going to render_rect,
but what should I do with NvBuffer? In a function similar to read_video_frame I have tried:

void cuda_read_video_frame(
    unsigned char* y_arr,
    unsigned char* u_arr,
    unsigned char* v_arr,
    NvBuffer & buffer)
{

   for (i = 0; i < buffer.n_planes; i++)
   {
        NvBuffer::NvBufferPlane &plane = buffer.planes[i];
        plane.bytesused = 0;

             if(i==0)  cudaMemcpy2D(plane.data, plane.fmt.stride, y_arr, plane.fmt.width, plane.fmt.width, plane.fmt.height, cudaMemcpyDeviceToDevice);
        else if(i==1)  cudaMemcpy2D(plane.data, plane.fmt.stride, u_arr, plane.fmt.width, plane.fmt.width, plane.fmt.height, cudaMemcpyDeviceToDevice);
        else if(i==2)  cudaMemcpy2D(plane.data, plane.fmt.stride, v_arr, plane.fmt.width, plane.fmt.width, plane.fmt.height, cudaMemcpyDeviceToDevice);

        plane.bytesused = plane.fmt.stride * plane.fmt.height;
    }
}

but it doesn’t work.
If I change it to cudaMemcpyDeviceToHost everything works fine! (although very slow, since the DeviceToHost copy takes more time)

So still not clear for me where can I inject my cuda_data? Should I copy into nvbuf_surf->surfaceList[0].mappedAddr.eglImage? then what how should I fill NvBuffer planes?

Hi,
Please check HandleEGLImage() in

/usr/src/jetson_multimedia_api/samples/common/algorithm/cuda/NvCudaProc.cpp

Would need to call the CUDA functions to get the pointer and then can copy data to the NvBufSurface.

Hi,
Now I am getting some ideas.
It seems using the below code, I can transfer my data to the NvBufSurface eglImage:

for (uint32_t j = 0 ; j < buffer->n_planes; j++)
{
     NvBufSurface *nvbuf_surf = 0;
     NvBufSurfaceFromFd(buffer->planes[j].fd, (void**)(&nvbuf_surf));
     NvBufSurfaceMapEglImage(nvbuf_surf, -1);
     NvBufSurfaceParams *sfparams = &nvbuf_surf->surfaceList[0];
     NvBufSurfacePlaneParams *nvspp = &nvbuf_surf->surfaceList[0].planeParams;

     CUgraphicsResource pResource;
     cuGraphicsEGLRegisterImage(&pResource, sfparams>mappedAddr.eglImage, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD);
     CUeglFrame eglFrame;
     cuGraphicsResourceGetMappedEglFrame( &eglFrame, pResource, 0, 0 );

          if(i==0)  cudaMemcpy2D(eglFrame.frame.pPitch[0], nvspp->pitch[0], y_cuda, nvspp->width[0], nvspp->width[0], nvspp->height[0], cudaMemcpyDeviceToDevice);
     else if(i==1)  cudaMemcpy2D(eglFrame.frame.pPitch[1], nvspp->pitch[1], u_cuda, nvspp->width[1], nvspp->width[1], nvspp->height[1], cudaMemcpyDeviceToDevice);
     else if(i==2)  cudaMemcpy2D(eglFrame.frame.pPitch[2], nvspp->pitch[2], v_cuda, nvspp->width[2], nvspp->width[2], nvspp->height[2], cudaMemcpyDeviceToDevice);
}

but what about the NvBuffer* now? Should I do another copy to buffer->planes[j].data?

The enoder thread:

encoder_capture_plane_dq_callback(struct v4l2_buffer *v4l2_buf, NvBuffer * buffer, NvBuffer * shared_buffer, void *arg)

strictly needs the NvBuffer* data, but still I couldn’t find the way to transfer the data to NvBuffer using only one cudaMemcpyDeviceToDevice copy !?

Hi,
For feeding NvBufsurface in output plane, you don’t need to fill in NvBuffer. Please also refer to this patch:
How to use v4l2 to capture videos in Jetson Orin r35 Jetpack 5.0 and encode them using a hardware encoding chip - #8 by DaneLLL

And please check write_video_frame() for getting encoded stream in capture plane.