Tesla M40 6 second CUDA startup time

I’m working on a server with a Tesla M40 24GB and a Tesla K40c, and running anything compiled with the NVCC compiler takes 6 seconds to execute, even if it’s a totally trivial program. For example:

int main(void) {
  cudaDeviceSynchronize();
  return 0;
}

takes 6.5 seconds to complete. Running nvprof, I see that only about 200ms of this is accounted for by actual CUDA runtime. Here is a screenshot of the nvprof output: https://imgur.com/a/sYvxHg5. Any ideas? I’ve seen some reports (https://devtalk.nvidia.com/default/topic/985612/64-bit-windows-10-gtx-1060-cuda-kernel-startup-time-/, for example) of large startup times on Windows, and someone mentioned high startup times for clusters with multiple Tesla GPUs, but this seems extreme. I’m not sure how to improve this performance. Thanks!

set persistence mode in the driver.
beyond that, servers with large amounts of installed system memory will tend to have longer CUDA runtime initialization times. There isn’t much you can do about this. It may help if you only need a single device to use CUDA_VISIBLE_DEVICES environment variable to limit the runtime scope to a single GPU.

You can google and find information about these topics I have mentioned, such as persistence mode and CUDA_VISIBLE_DEVICES.