MMAPI samples: v4l2cuda buffer size question for zero copy

We are implementing a camera grabber that writes to unified memory. We are taking the MMAPI samples as a reference. In line 447 of v4l2cuda/capture.cpp, the buffer size rounded up from the actual image size to the next multiple of the page size:

buffer_size = (buffer_size + page_size - 1) & ~(page_size - 1);

What is the reason for that?

On a side note, I am also wondering if it is fine if the kernel accesses the globally attached buffer while CUDA kernels are running. If I am not mistaken, this would lead to BUS_ERRORs in user space.

We think the performance could be better to align buffer size with page size. Should work if you remove the logic of doing alignment. Please give it a try.

Hey, good to know it should work without explicit alignment. Could you elaborate what about the performance? Do you mean the performance of the cuda kernel? Or of the V4L2 internals? Or of the unified memory subsystem?