Measure the amount of overhead / gpu memory that will be used by the CUBLAS initialization

Is there some means to measure the amount of overhead / gpu memory that will be used by the CUBLAS initialization. As part of my application level load-balancing, I need to provide a gpu-memory estimate that includes gpu memory requirements of my program + overheads of CUBLAS + cuDNN. I tried subtracting the free memory before and after the CUBLAS initialization, but that provides large value on the first call in the process but provides small value in the subsequent calls. I would like to know if there is one way that could provide an accurate estimate of the gpu memory overhead (library overhead + temporary overhead) for the CUBLAS initialization.

CUBLAS initialization will usually be triggered by the first call to create a handle. Thereafter, subsequent calls would not re-trigger CUBLAS creation. There will be a significant one-time memory cost for both host and device memory, to initialize CUBLAS. So this is expected, I think:

You can certainly measure it, using methods like what you describe, but I don’t know of a way to estimate it, programmatically.

Note that the device memory “cost” for CUBLAS might consist largely of CUDA overhead, not anything specific to CUBLAS. So if your first call initializes CUDA, and then later you create a CUBLAS handle, you may notice lower “overhead” associated with CUBLAS in that case.

The measureCublasOverhead() function that I have written takes the gpu memory snapshot before and after the cublasCreate() call and as well does cublasDestroy() immediately. The arithmetic difference of the gpu memory snalshot is returned by measureCublasOverhead(). Also the CUDA initialization has been done well before the call to measureCublasOverhead().

In the Quadro P1000 gpu that I tested the first call to measureCublasOverhead() returns 70MB but the subsequent calls return 10MB. As you had rightly mentioned in another topic (leak in cublasCreate + cublasDestroy?), part of the cublas initialization overhead is not released on call to cublasDestroy() and hence the subsequent calls to measureCublasOverhead() return less value. But is there a way to get the measure as 70MB on every call?

I doubt it. I don’t understand why you would want to do that. The point is that the CUBLAS library has some one-time overhead associated with library initialization, plus some per-create overhead that appears each time you create a handle.

The one time overhead should appear once in your program. Not on every create/destroy cycle.

Hi Robert,

Is there a way to release the gpu memory that is consumed as part of the one-time overhead of cublasCreate() / cudnnCreate() referred here?

cublasDestroy() (or similar for cuDNN) is the only method I am aware of.

So I think the correct answer is that I am not aware of any way to release the one-time cublas overhead, other than program termination.