NvBufferColorFormat Grey

According to v4l2_nv_extensions.h NvVideoConverter seems to be able to support V4L2_PIX_FMT_GREY, however in nvbuf_utils.h there is not a greyscale option.

I am wanting to copy the V4L2_PIX_FMT_GREY NvBuffer into another dmabuf_fd created with NvBufferCreateEx so I can pull out the egl_frame and wrap it with a cuda::GpuMat as done in the sample by DaneLLL here. Everything works if I use one of the other RGB formats, but I plan on passing that GpuMat through remap and into visionworks for disparity, so having it in greyscale already would be helpful.

If I’m going to be wrapping it in a GpuMat anyway should I copy it out into a buffer created with cudaAllocMapped instead? If so what would be the best way to copy that data out of the NvBuffer?

Ok I created an input and output buffer with cudaAllocMapped to store the greyscale data and copied the NvBuffer into it with NvBuffer2Raw. Everything seems to be working.

Is there a better way of getting that data from the NvBuffer to a GpuMat?

void *DisparitySink::process(void *priv)
    DisparitySink *ctx = static_cast<DisparitySink*>(priv);
    CUcontext cuda_ctx = 0;
    CUresult status;

    // Allow zero copy access

    try {
        status = cuInit(0);
        if(status != CUDA_SUCCESS)
            throw status;

        CUdevice dev;
        status = cuDeviceGet(&dev, 0);
        if(status != CUDA_SUCCESS)
            throw status;

        status = cuCtxCreate(&cuda_ctx, 0, dev);
        if(status != CUDA_SUCCESS)
            throw status;
    catch (CUresult &status) {
        const char *error;
        cuGetErrorString(status, &error);
        return nullptr;

    cuda::GpuMat mapx;
    cuda::GpuMat mapy;

    Mat K, R, P;
    Vec4d D;

    FileStorage storage("1920x1080.yml", FileStorage::READ);
    storage["K0"] >> K;
    storage["D0"] >> D;
    storage["R0"] >> R;
    storage["P0"] >> P;

    Mat cpu_mapx, cpu_mapy;
        K, D, R, P, Size(ctx->m_width, ctx->m_height), CV_32FC1, cpu_mapx, cpu_mapy);


    void *input_cpu = nullptr;
    void *input_cuda = nullptr;

    cudaAllocMapped(&input_cpu, &input_cuda, ctx->m_width*ctx->m_height);
    cuda::GpuMat cv_in(ctx->m_height, ctx->m_width, CV_8UC1, input_cuda);

    void *output_cpu = nullptr;
    void *output_cuda = nullptr;

    cudaAllocMapped(&output_cpu, &output_cuda, ctx->m_width*ctx->m_height);
    cuda::GpuMat cv_out(ctx->m_height, ctx->m_width, CV_8UC1, output_cuda);

    while(1) {
        struct v4l2_buffer v4l2_buf;
        struct v4l2_plane planes[MAX_PLANES];

        memset(&v4l2_buf, 0, sizeof(v4l2_buf));
        memset(planes, 0, sizeof(planes));

        v4l2_buf.m.planes = planes;


        while(ctx->m_capture_queue->empty()) {
            pthread_cond_wait(&ctx->m_capture_cond, &ctx->m_capture_lock);

        NvBuffer *buffer = ctx->m_capture_queue->front().first;
        struct timeval ts = ctx->m_capture_queue->front().second;


        if(buffer->planes[0].bytesused == 0)

        v4l2_buf.index = buffer->index;

        // Copy NvBuffer to mapped buffer
            buffer->planes[0].fd, 0, ctx->m_width, ctx->m_height, static_cast<uint8_t*>(input_cuda));

        cuda::remap(cv_in, cv_out, mapx, mapy, INTER_LINEAR);

        // Re-queue MMAP buffer on capture plane
        if(ctx->m_conv->capture_plane.qBuffer(v4l2_buf, nullptr) < 0) {
            DEBUG_ERROR("failed to queue buffer");

        // Write out image
        if(ctx->m_write_flag) {
            capture_frame(output_cpu, ctx->m_width, ctx->m_height, "cv_out.png");
            ctx->m_write_flag = false;

    status = cuCtxDestroy(cuda_ctx);
    if(status != CUDA_SUCCESS)
        DEBUG_WARN("unable to destroy CUDA context");


    DEBUG_VERBOSE("disparity thread finished");
    return nullptr;


It depends on the use case.

There are several memory type can be used on the Jetson.
For example, pinned memory won’t induce an overhead but is slower with the frequently accessing pattern.
Unified memory is efficient but is not preferable in the case that cache is not working.

You can find more information and comparison here:


That is helpful, thank you.

So since I am just wrapping the memory in a GpuMat I should probably keep it in Device Memory rather than as cudaHostAllocMapped.

Just for my understanding, would a NvBuffer be considered Device Memory? Since you have to explicitly map and sync it for use with the CPU.

Switching out cudaAllocMapped for cudaMalloc causes stride errors from NvBuffer2Raw. Should I use cudaMallocPitch instead, or is there something else I’m missing?

NVMAP_IOC_READ failed: Interrupted system call
NVMAP_IOC_READ: Offset 0 SrcStride 2048 pDst 0xfc4980000 DstStride 1920 Count 1080


NvBuffer is DMA memory and is mapped into GPU accessible with EGL interface.