Display bayer CSI camera output without ISP


I’d like to capture, debayer, and display my camera output without using the ISP on the TX2. I have some questions about this.

  1. Am I correct in planning to capture with v4l2, perform RGGB->RGB->YUV420 conversion with NPP, and display with NVDrmRenderer?
  2. Is there an example I could refer to that does this?


hello lancehxh2z,

  1. may I know which camera you’re using, is it a YUV sensor or bayer sensor or something else?
  2. I would like to have more details about your use-case, for example, display live preview camera frames.
  3. you may found some examples from Camera Software Development Solution chapter to access the camera sensor.
  1. 4k bayer sensor
  2. Yes, I would like to display live preview frames at 60fps
  3. I mean an example to display frames from v4l2 without libargus or gstreamer. I have the sensor driver working for argus but need to have access to bayer data

To clarify, if I have this code:

int srcStep, dstStep;
Npp8u *src = nppiMalloc_8u_C1(width, height, &srcStep);
Npp8u *dst = nppiMalloc_8u_C3(width, height, &dstStep);
nppiSet_8u_C1R(255, src, srcStep, {width, height});

NppiSize size = { width, height };
NppiRect roi  = { 0, 0, width, height };
nppiCFAToRGB_8u_C1C3R(src, srcStep, size, roi, dst, dstStep, NPPI_BAYER_RGGB, NPPI_INTER_UNDEFINED);

How do I display the dst buffer?

hello lancehxh2z,

don’t you need ISP to handle demosaic?
if you handle demosaic by your own, you should refer to [L4T Multimedia API Reference], and check the EGL render for sending the buffer to display.

You may also need to adjust the data a bit depending on your image data format. For example, raw12, is packed into bits 0-13 with bits 0 and 1 being copies of the most significant bits of the pixel. See ‘27.10.6 RAW Memory Formats’ of the Parker TRM.


 15  14  13  12  11  10  09  08  07  06  05  04  03  02  01  00
 00  00 D11 D10 D09 D08 D07 D06 D05 D04 D03 D02 D01 D01 D11 D10
1 Like

I’d be interested in a similar example. In my case, how to capture a raw Bayer 12-bit RGGB data from an IMX264, debayer using the NPP (or other means), and continue processing in CUDA (no rendering required).

Both the argus and nvcamerasrc ISPs have poor debayering results at low exposure levels. We can achieve much better results debayering manually.

Currently, capturing raw bayer images through v4l2-ctl works fine and the image can be debayerd to greyscale in a separate process:

v4l2-ctl --device /dev/video0 --set-fmt-video=width=2464,height=2066,pixelformat=RG12 --stream-mmap --stream-to=output1.raw --stream-count=1 --stream-skip=1

Is using the v4l2cuda and camera_v4l2_cuda the best starting point? One uses V4L2_MEMORY_MMAP the other V4L2_MEMORY_DMABUF. Is there a notable difference?

@RS64 If you don’t need to render the image then v4l2cuda is already done for you. Just use -u -z options and debayer in process_image.

@JerryChang I am most interested in a sample for loading an NvBuffer with raw image data and rendering with NvEglRenderer/encoding with NvVideoEncoder.

hello lancehxh2z,

besides debayer process, you may refer to MMAPI examples, 12_camera_v4l2_cuda to capture images via V4L2 and rendering it to display.

Thanks Lance + Jerry,

I see in v4l2cuda in the gpuConvertYUYVtoRGB() call they end up doing a

cudaMemcpy(d_src, src, planeSize * 2, cudaMemcpyHostToDevice);

for copying data from the host CPU to the device and back again. It would be nice not to have to do this. In the 12_camera_v4l2_cuda example they use a DMA buffer:

buf.index = index;
buf.memory = V4L2_MEMORY_DMABUF;

but in this example, the dma fd comes from

NvBufferCreateEx(&fd, &input_params)

where input_params has a NvBufferColorFormat, but there is no enum for 12-bit RG12.

Is there a way to get a raw 12-bit bayer image on to the GPU using DMA for CUDA processing directly? We don’t need to render anything. Maybe we can recycle one of the existing formats? Does the driver require special support for this ability?

1 Like

@RS64 v4l2cuda does zero copy. Just use the -u and -z options. Look in init_userp and you’ll see it using cudaMalloc to allocate the v4l2 reqbufs. So when you DQBUF the buf.m.userptr points to device memory.

Thanks for getting back to me Lance, I appreciate you taking the time to respond!

I had a look at those options. I see

if (cuda_zero_copy) {
    cudaMallocManaged (&buffers[n_buffers].start, buffer_size, cudaMemAttachGlobal);

From what I understand, cudaMalloc() and cudaMallocManged() are very different things. The former is true device memory. The latter allocates device memory, given host memory, and auto-manages the data transfer between the device/host depending on which one is requesting access to the data. I would hazard to guess that the zero-copy being referred to is between the camera and main memory and not the camera and GPU memory.

With cudaMallocManged() and no prefetch call, I would expect the kernel launch to block until the data transfer is complete.

When I hear ‘zero-copy’ I assumed camera -> GPU memory, until I saw cudaMemcpy(…, cudaMemcpyHostToDevice) and cudaMallocManged() used.

Given we will debayer in the GPU, it seems inefficient to have a 2 step data transfer.

If I misunderstand this, by all means, I’m more than open to learning more about it :)

Thanks Jerry, We do not need to render, but can we use the same 12_camera_v4l2_cuda approach to copy a 12-bit RGGB direct to the GPU given there is no 12-bit format supported? according to:

static nv_color_fmt nvcolor_fmt[] =
    // TODO add more pixel format mapping
    {V4L2_PIX_FMT_UYVY, NvBufferColorFormat_UYVY},
    {V4L2_PIX_FMT_VYUY, NvBufferColorFormat_VYUY},
    {V4L2_PIX_FMT_YUYV, NvBufferColorFormat_YUYV},
    {V4L2_PIX_FMT_YVYU, NvBufferColorFormat_YVYU},
    {V4L2_PIX_FMT_YUV420M, NvBufferColorFormat_YUV420},

hello RS64,

the example, 12_camera_v4l2_cuda is for YUV sensor or USB-camera, it may not support 12-bit RGGB color format.
could you please refer to ~/multimedia_api/ll_samples/samples/v4l2cuda/ instead.

Hi D3_growe,
I am porting 12 bit monochrome image sensor driver from Jetson TX1 to TX2. I am able to stream and store frames using v4l2-ctl and gst-launch-1.0 commands.
v4l2-ctl --device /dev/video0 --set-fmt-video=width=1280,height=804,pixelformat=RG12 --stream-mmap --stream-to=output1.raw --stream-count=10 --stream-skip=1
The raw image sensor data is configured to produce bayer 12 bit format, here is the mode part of dts file.
mode0 {
active_w = “1280”;
active_h = “804”;
mode_type = “bayer”;
pixel_phase = “grbg”;
pixel_t = “bayer_grbg12”;
dynamic_pixel_bit_depth = “12”;
csi_pixel_bit_depth = “12”;
readout_orientation = “0”;
line_length = “2560”;

For TX1, similar setting produces 12 bit data in raw frame, but for TX2 it produces 14 bit per pixel. I can convert 14 bit data to 12 bits per pixel in application, but its resource intensive. Is there any way I can get the 12 bit raw data directly from image sensor same as TX1 ?

I do not know of a way, off hand, to turn off the pixel packing in Tx2 and Xavier (which differ from each other too). I suspect the answer lies in the TRM but I do not know the answer. I’m interested in knowing if this is easy or possible too.

Thanks a lot @D3_growe, I will check TRM to see if I can do anything, also glad to know I am not the only one with this problem. Last query was my first post in the nvidia forum, should I be creating a new topic on forum to get attention of developers who might have solved this issue already or every one gets notification that we are discussing the issue in this thread ?

It depends on the users settings but the default is to get email notifications if you’ve participated on a thread.

Is the processing overhead that bad? If you’re using CUDA this should be highly parallel. You’d just shift right for each pixel. Still, it would be nice if you could simply modify a register and get the data already in the format you need.

Good luck!

Hi @D3_growe,
Thanks for your quick reply and suggestions, really appreciate it.
For 60 fps and running few other filters while frame acquisition is dropping approximately 1% frames (over 30 mins). Our pipeline is not optimized to use single buffer across several components of the gstreamer pipeline so we have few memory copies going on. I have tried to replace one of the memcpy call with a new copy function that does 14 --> 12 bit conversion by right shifting 2 bits. Yes, the next thing I was about to try was cuda to perform the bit conversion.