I’m working on a server with a Tesla M40 24GB and a Tesla K40c, and running anything compiled with the NVCC compiler takes 6 seconds to execute, even if it’s a totally trivial program. For example:
int main(void) {
cudaDeviceSynchronize();
return 0;
}
takes 6.5 seconds to complete. Running nvprof, I see that only about 200ms of this is accounted for by actual CUDA runtime. Here is a screenshot of the nvprof output: https://imgur.com/a/sYvxHg5. Any ideas? I’ve seen some reports (https://devtalk.nvidia.com/default/topic/985612/64-bit-windows-10-gtx-1060-cuda-kernel-startup-time-/, for example) of large startup times on Windows, and someone mentioned high startup times for clusters with multiple Tesla GPUs, but this seems extreme. I’m not sure how to improve this performance. Thanks!