I recently realised that initialising the cusolver library using cusolverDnCreate reserves a rather large quantity of GPU memory (around 450MB). This is not a problem itself, but it seems that this memory is not fully freed by cusolverDnDestroy as querying the available GPU memory before and after creating and destroying the cusolver instance shows that around 390MB of initially free GPU memory is still allocated. This happens even if the function using Cusolver is compiled separately and linked to the main function. Running Cuda-memcheck does not report any problems either. It then seems that any large program using any single cusolver functionality will have 390MB less GPU memory available for any other GPU tasks for the whole runtime of the program starting from the point where the instance is created.
Here’s a sample code:
#include <cuda_runtime.h>
#include <cusolverDn.h>
#include <stdio.h>
#include "cudaCheck.hpp"
void CusolverHandleTest()
{
cusolverDnHandle_t handle;
cusolverStatus_t status;
status = cusolverDnCreate(&handle); // create handle
if(status != CUSOLVER_STATUS_SUCCESS){
printf("CusolverDnCreate failed\n");
}
// ...Use cusolver however needed
status = cusolverDnDestroy(handle); // destroy handle
if(status != CUSOLVER_STATUS_SUCCESS){
printf("CusolverDnDestroy failed\n");
}
}
int main(void){
size_t mf0,mf1,ma0, ma1;
CHECK(cudaMemGetInfo(&mf0, &ma0));
CusolverHandleTest(); // Create and destroy Cusolver instance in different scope
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemGetInfo(&mf1, &ma1));
printf("How much more GPU memory is now allocated than in the start: %ld MB\n" , (mf0- mf1)/ (1024*1024) );
return 0;
}
where CHECK() is a basic cuda error checking macro.
Is there something wrong with how I use cusolver or cudaMemGetInfo or is this a bug?
I’m using CUDA version V11.2.142 with Ubuntu 20.04.5 LTS
When I run your code as posted on CUDA 11.6 or CUDA 11.8, I get a report of 81MB, not 390MB.
CUDA uses lazy initialization, so you may not have CUDA fully initialized at the first call to cudaGetMemInfo
. Then when you make the 2nd call, there will be some CUDA overhead.
However I don’t know anywhere that it is claimed that doing a handle destroy will release all library overhead. So I’m pretty confident this is not a bug.
For example, when CUDA loads a library like cusolver, it loads all the kernels in the cusolver library. Destroying a handle doesn’t unload all these kernels.
If you’d like to see a change in CUDA behavior, you can always file a bug, and also you may want to investigate CUDA opt-in (for CUDA 11.7 and 11.8) “lazy” module loading. This will likely reduce the memory footprint.
compile with the following env var set:
CUDA_MODULE_LOADING=LAZY
using CUDA 11.7 or 11.8. However, as I reported, when I test the code you have posted here, I get 81MB, not 390MB, and this switch has no effect on that observation.
Thanks for the clarifications! I don’t have a possibility to install CUDA 11.7 or 11.8 at the moment, but I’ll keep in mind to later try the lazy module loading. The amount you got 81MB seems much more reasonable than 390MB, so let’s hope this difference is just a matter of newer cuda version.
I did a few further tests and it seems that standard cuda library calls dont have a similar effect on memory usage: calling custom kernels and memory transfers instead of creating a cusolver instance results in memory “loss” of only 1-2 MB. So it seems that the large memory allocation is mainly due to loading cusolver library and not due to initialization of standard CUDA libraries only after the first cudaGetMemInfo
call.
As this seems to be a feature of Cusolver, I should then take into account the extra memory needed for loading the library when checking the memory needed by certain custom functions. As context I would like to have routines that would do certain tasks either on GPU with cusolver or on CPU depending on if the data fits to GPU memory. The easiest solution I could think of would be to always create the cusolver handle before checking the free GPU memory with cudaMemGetInfo
, even in cases where the handle is not needed in current scope, but in some subroutines.
The real problem I see with this behaviour is that it restricts the memory available to custom standard CUDA kernels/functions which are not even using cusolver, just because they are used as subroutines in larger program which uses cusolver. Is there any way to explicitly unload the whole library once I know it wont be used again in the same program?
I’m not aware of any that are not extreme. An extreme method would be to create a multi-process application. For example, start a host process that doesn’t do anything with CUDA. Then spawn a process and do what you need with cusolver. Then terminate that spawned process. That will unload the cusolver library (and everything else that process owned). As I’ve mentioned already you can file a bug.