I have a cuda program with opencv 3.2 library, and building on the Visual Studio 2015, running on GTX1080.
Inside the program, there are many GpuMat memory alloc and release.
But memory alloc and release frequently will have a big impact on the performance, so for improving the performance, I used the below two methods for memory pool management, but they are all not OK:
- Just allocate the all needed buffers before kernel functions executing, because my program is complicated, so I need allocate total 800 buffers before executing the kernel functions, and after I used 8 streams for computing, the total size will be buffersize * 8, and these allocations will need about 4G device memory. Obviously, this is ugly and impossible for running on customer’s computers.
- Because the above method is not ok, I used the cuda BufferPool from opencv, but this cuda buffer pool is implemented by stack, that is to say, if you get four buffers a, b, c, d from the buffer pool, you must ensure the release sequence is d, b, c, a. This simple stack buffer pool is very difficult to use, I have already got many memory confusion bugs after using this buffer pool.
So could somebody help me about some better memory buffer management methods for performance improving?
Thank you very much!