Handling memory allignment in nvbufsurface vs cuda

I’m currently working on a GStreamer pipeline where I use CUDA for format conversion from Gray16 (16-bit grayscale) to NV24. My input image is allocated using cudaMallocPitch(), and the output is an NvBufSurface allocated with NVBUF_MEM_SURFACE_ARRAY, which stores NV24 formatted data.
I’m facing memory alignment mismatches between the source (Gray16 allocated with cudaMallocPitch) and the destination (NvBufSurface in NV24 format), which is causing issues in my CUDA kernel. The resulting output shows color artifacts like green tint, especially on alternate frames, or sometimes shows frame corruption.
The pitch returned by cudaMallocPitch() for Gray16 is typically greater than NvBufSurface pitch.
What is the recommended way to handle pitch mismatches between a CUDA-allocated buffer and a GStreamer NvBufSurface?

Since input and output are two separate buffers, its not clear to me why a pitch difference matters. I guess if you are taking an output in a pitched buffer of pitch X, and wanting to use it in a pitched buffer of pitch Y (say for input) then you would need to do pitch conversion (i.e. buffer-to-buffer copying).

Alternatively, skip the use of cudaMallocPitch altogether. It’s a relic of CUDA that I don’t think has much useful purpose anymore. For buffers that are pitch-aware, simply specify pitch=width.

Alternatively, skip the use of cudaMallocPitch, and allocate your CUDA buffers to match whatever other pitch value you think is important. This buffer then is still a “pitched” buffer, where the pitch value is of your choosing. It’s unlikely (IMO) to cause trouble or performance concerns, if used correctly.

Yes, the output is pitched buffer (NVBUF_LAYOUT_PITCH ). BY cudaMallocPitch gives pitch value 4096 where the nvbufsurface with nv24 has pitch 2048 only. In my cuda kerenl function

if (x >= width || y >= height) return;

int index = y * width + x;

// Convert 16-bit grayscale to 8-bit Y
y_plane[index] = gray16[index] >> 8;

// Set full-resolution UV (interleaved)
int uv_index = index * 2;

uv_plane[uv_index + 0] =128 ;// uv_val;//128; // U
uv_plane[uv_index + 1] =128;// uv_val;//128; // V

Which is working for single image conversion. When I tested the same with video file first frame is correct remaining are green screen.(CUDA kernel error: an illegal memory access was encountered)