Bug: CUDA's internal kernel cache / cumodule cache does not have a maximum size causing memory leaks

I’m running CUDA 10 with driver version 417.22, and encountered a strange bahaviour of cuModuleLoadDataEx and cuModuleUnload: CUDA seemingly keeps the modules created by cuModuleLoadDataEx in memory as a cache, even if cuModuleUnload is called. This behaviour is not influenced by disabling/enabling the kernel cache on the HDD. Consequently, compiling the same PTX string the second time is very fast, since CUDA only needs to do an lookup. Unfortunately, this cache seemingly does not have a maximum size, since compiling the same PTX-string the second time is always fast, no matter how many kernels have been compiled in between. Making things worse, compiling a new PTX-String to a kernel always increases the memory usage of my program by several megabytes, which seemingly is not freed by calling cuModuleLoadDataEx. As a consequence, my program quickly fills all memory (8 GB) while compiling thousands of kernels in order to find the fastest ones.

I suggest you file a bug using the information in the sticky post at the top of this sub-forum.

You will likely be asked for a complete code that reproduces your observation.

Ok, thanks :)