Need a better buffer management pool for improving performance

Hi,

I have a cuda program with opencv 3.2 library, and building on the Visual Studio 2015, running on GTX1080.
Inside the program, there are many GpuMat memory alloc and release.

But memory alloc and release frequently will have a big impact on the performance, so for improving the performance, I used the below two methods for memory pool management, but they are all not OK:

  1. Just allocate the all needed buffers before kernel functions executing, because my program is complicated, so I need allocate total 800 buffers before executing the kernel functions, and after I used 8 streams for computing, the total size will be buffersize * 8, and these allocations will need about 4G device memory. Obviously, this is ugly and impossible for running on customer’s computers.
  2. Because the above method is not ok, I used the cuda BufferPool from opencv, but this cuda buffer pool is implemented by stack, that is to say, if you get four buffers a, b, c, d from the buffer pool, you must ensure the release sequence is d, b, c, a. This simple stack buffer pool is very difficult to use, I have already got many memory confusion bugs after using this buffer pool.

So could somebody help me about some better memory buffer management methods for performance improving?
Thank you very much!

I think it is difficult to make recommendations without having knowledge of you software design.

(1) This may come across as harsh, but if the app really requires 800 different buffers, this sounds like a poor design. Re-think the design from the top, with efficient memory usage in mind.

(2) Avoid frequent allocation and de-allocation. Instead try to re-use buffers as often as possible. Usually it helps to use as few distinct buffer sizes as possible. Use simple means (e.g. reference counters) to determine which previously allocated buffers are available for re-use. If no buffer is available for re-use create a fresh allocation. This will at least cut down on the number of allocations.

(3) Beyond that make use of the copious literature on managing memory. For example, some applications use memory pools, others slab allocators.

which driver version do you use ? Because for recent driver versions unfortunately the runtime for memory allocation (especially for big buffer) has worsened significantly - see https://devtalk.nvidia.com/default/topic/963440/cudamalloc-pitch-significantly-slower-on-windows-with-geforce-drivers-gt-350-12/