I have a program linked against cuda-8.0.26 which I’m trying to run on a Tesla V100 and a CentOS 6.5 host. When the binary has been cached, the running program uses about one GB of host RAM, which is what I expected. However, the first time the program runs (or when I test with CUDA_FORCE_PTX_JIT=1) it uses much, much more memory than that. The specific amount of memory used varies with driver version; on 396.26 it is about 26 GB. Furthermore, this memory is not released when the JIT is finished and the program starts running. It is only released when the program terminates.
The program contains a large number of kernels and many of these kernels are templated, so using a lot of RAM to JIT-compile it makes sense to me. However, 26 GB is difficult to accommodate, especially since it isn’t freed until hours later. Is there a way I can fix this?