I’m currently implementing a video streaming pipeline and am curious the recommended best practices for organized shared allocations.
Use case
I am creating a zero-copy single-producer multi-consumer frame management tool. Frames are produced, and the resulting GPU memory is distributed to multiple downstream consumers. Each consumer then can read from this same buffer.
Current Strucutre
I have created host-level shared memory regions that contain things like frame metadata and locks to verify that frames are not overwritten while being read. These headers are stored in a ring buffer of relatively small size (1 second worth of frames, 25 frames for now).
NVIDIA Orin devices are the target architecture, but it is expected to function on standard discrete GPUs to aid in training.
I am deciding how to implement the shared memory on the device side. I can think of three approaches, but I am unsure which is considered best practice.
Single Allocation
Allocate the entirety of the frame buffer with a single cuMemCreate call, of size frame_size * n_frames
. Distribute and share that single cuMemShareable handle, and have each client access that handle. That single segment would be used for all shared frame data.
One-time multi allocation
Allocate the ring buffer initially, but do a separate allocation for each frame object. Explicityly, this means n_frames
allocations, each of frame_size
. They would still be mapped to become adjacent, but they would have distinct underlying base allocations, which may improve performance, I am uncertain.
Single Allocation for each frame
Create a new allocation for each frame, and distribute the shared handles for each frame as they are processed. This would result in extra allocation latency, but once the frames were created I would be able to madvise them as read-only, and that could possibly increase performance.