Using CUBLAS in a shared library (or a memory leak in CUBLAS?)

I am writing a library that uses CUBLAS to fit statistical models. Currently, I have a routine that calls cublasCreate() and cublasDestroy() for working with CUBLAS. In some cases, this routine needs to be called repeatedly with varying inputs for, as an example, a simulation study.

What I’ve noticed, even when I simply use cublasCreate/Destroy(), the GPU has a small <1MB amount that is never deleted. In fact, unless I call cudaDeviceReset(), there is ~40MB allocated and never freed during each call to the cublasCreate/Destroy() pair.

I assume that this probably isn’t a memory leak, but instead I am not using CUBLAS as intended. With that said, does it make sense to call cublasCreate/Destroy() as I am doing, or should I be doing something different with my single execution thread?

Thanks for any help!

Here is some valgrind output (note that none is in use at exit when I do not use cublas)

==11437== HEAP SUMMARY:
==11437==     in use at exit: 44,798 bytes in 51 blocks
==11437==   total heap usage: 98,961 allocs, 98,910 frees, 60,601,533 bytes allocated

==11437== 16 bytes in 1 blocks are definitely lost in loss record 1 of 45
==11437==    at 0x4A06FC7: operator new(unsigned long) (vg_replace_malloc.c:261)
==11437==    by 0x9212072: ??? (in /usr/local/cuda-5.0/lib64/
==11437==    by 0x923E478: ??? (in /usr/local/cuda-5.0/lib64/
==11437==    by 0x91FDD6F: ??? (in /usr/local/cuda-5.0/lib64/
==11437==    by 0x3A7D00F77D: _dl_fini (in /lib64/
==11437==    by 0x3A7D439930: __run_exit_handlers (in /lib64/
==11437==    by 0x3A7D4399B4: exit (in /lib64/
==11437==    by 0x3A7D4216A3: (below main) (in /lib64/


==11437== LEAK SUMMARY:
==11437==    definitely lost: 16 bytes in 1 blocks
==11437==    indirectly lost: 0 bytes in 0 blocks
==11437==      possibly lost: 1,496 bytes in 11 blocks
==11437==    still reachable: 43,286 bytes in 39 blocks
==11437==         suppressed: 0 bytes in 0 blocks


What CUDA library version are you using?

I am asking because I noticed a similar issue with CUDA 4.2 whenever a cudaGetDevice() call is done before the first cudaSetDevice() call. I also got 16 bytes reported as ‘definitely lost’ by valgrind (v. 3.8.1). The problem seems to be solved with CUDA 5.5, but some bytes remain ‘possibly lost’. If a cudaSetDevice() is done before the first cudaGetDevice(), no leak is reported. I suspect that the 16 bytes that are reported by valgrind in your example come from that problem, not from cublasCreate(). However, cublasCreate() seems to generate a number of ‘possibly lost’ bytes (I tested with both CUDA 4.2 and CUDA 5.5).

More generally, with CUDA 5.5, it appears that the first CUDA call (e.g., cudaGetDeviceCount(), cudaSetDevice(), cudaDeviceReset()) always generates a number of ‘possibly lost’ bytes. Not sure if this is valgrind issue, or some initialization bug in the CUDA library.