Need a better buffer management pool for improving performance

hardpaul · June 5, 2017, 4:29am

Hi,

I have a cuda program with opencv 3.2 library, and building on the Visual Studio 2015, running on GTX1080.
Inside the program, there are many GpuMat memory alloc and release.

But memory alloc and release frequently will have a big impact on the performance, so for improving the performance, I used the below two methods for memory pool management, but they are all not OK:

Just allocate the all needed buffers before kernel functions executing, because my program is complicated, so I need allocate total 800 buffers before executing the kernel functions, and after I used 8 streams for computing, the total size will be buffersize * 8, and these allocations will need about 4G device memory. Obviously, this is ugly and impossible for running on customer’s computers.
Because the above method is not ok, I used the cuda BufferPool from opencv, but this cuda buffer pool is implemented by stack, that is to say, if you get four buffers a, b, c, d from the buffer pool, you must ensure the release sequence is d, b, c, a. This simple stack buffer pool is very difficult to use, I have already got many memory confusion bugs after using this buffer pool.

So could somebody help me about some better memory buffer management methods for performance improving?
Thank you very much!

njuffa · June 5, 2017, 4:45am

I think it is difficult to make recommendations without having knowledge of you software design.

(1) This may come across as harsh, but if the app really requires 800 different buffers, this sounds like a poor design. Re-think the design from the top, with efficient memory usage in mind.

(2) Avoid frequent allocation and de-allocation. Instead try to re-use buffers as often as possible. Usually it helps to use as few distinct buffer sizes as possible. Use simple means (e.g. reference counters) to determine which previously allocated buffers are available for re-use. If no buffer is available for re-use create a fresh allocation. This will at least cut down on the number of allocations.

(3) Beyond that make use of the copious literature on managing memory. For example, some applications use memory pools, others slab allocators.

HannesF99 · June 6, 2017, 1:39pm

which driver version do you use ? Because for recent driver versions unfortunately the runtime for memory allocation (especially for big buffer) has worsened significantly - see https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-_significantly_-slower-on-windows-with-geforce-drivers-gt-350-12/

Topic		Replies	Views
Cuda memory pool performance issue CUDA Programming and Performance cuda , api	4	2577	February 1, 2022
Why cudamalloc and cudaFree so expensive? CUDA Programming and Performance cuda	7	3133	November 14, 2020
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	743	September 13, 2024
Custom Memory allocator for Cuda desired CUDA Programming and Performance	2	3899	December 14, 2015
Memory allocation reliablity CUDA Programming and Performance	8	3332	August 18, 2008
CUDA Memory Usage Management CUDA Programming and Performance	3	140	June 10, 2025
cudaMalloc() CUDA Programming and Performance	0	871	October 9, 2013
Kernel is crashing for a GPUDirect application Jetson AGX Orin cuda , kernel	5	162	March 26, 2025
Cost of cuMemAlloc/cuMemFree CUDA Programming and Performance	2	2563	September 4, 2009
Is there a way to predetermine what the effective memory consumption of device allocations will be? CUDA Programming and Performance	1	68	December 10, 2024

Need a better buffer management pool for improving performance

Related topics