Best Practices for memory reuse

I’m currently implementing a video streaming pipeline and am curious the recommended best practices for organized shared allocations.

Use case

I am creating a zero-copy single-producer multi-consumer frame management tool. Frames are produced, and the resulting GPU memory is distributed to multiple downstream consumers. Each consumer then can read from this same buffer.

Current Strucutre

I have created host-level shared memory regions that contain things like frame metadata and locks to verify that frames are not overwritten while being read. These headers are stored in a ring buffer of relatively small size (1 second worth of frames, 25 frames for now).

NVIDIA Orin devices are the target architecture, but it is expected to function on standard discrete GPUs to aid in training.

I am deciding how to implement the shared memory on the device side. I can think of three approaches, but I am unsure which is considered best practice.

Single Allocation

Allocate the entirety of the frame buffer with a single cuMemCreate call, of size frame_size * n_frames. Distribute and share that single cuMemShareable handle, and have each client access that handle. That single segment would be used for all shared frame data.

One-time multi allocation

Allocate the ring buffer initially, but do a separate allocation for each frame object. Explicityly, this means n_frames allocations, each of frame_size. They would still be mapped to become adjacent, but they would have distinct underlying base allocations, which may improve performance, I am uncertain.

Single Allocation for each frame

Create a new allocation for each frame, and distribute the shared handles for each frame as they are processed. This would result in extra allocation latency, but once the frames were created I would be able to madvise them as read-only, and that could possibly increase performance.

I have never heard of the Orin platform. Check the forums for embedded platforms to see whether a dedicated subforum exists. You are likely to receive better and faster answers there.

In programming in general, and programming in the context of CUDA in particular, memory allocation / deallocation tends to expensive and should be minimized when performance is important. A classical technique used in many embedded applications is to allocate all memory needed by an application at the start of the application and re-use this memory indefinitely, e.g. in the form of a memory pool.

[Later:] The sub-forum for the Orin platform is here: Jetson AGX Orin - NVIDIA Developer Forums