I would like to avoid JIT compiling for a list of expected GPU’s, and instead fall back on JIT compiling if an unexpected GPU is used. Specifically, I am trying to compile some cuda kernels for sm_35 (Tesla K40c) and sm_52 (Tesla M40) using cuda 7.5. I only need features from sm_35. I have tried the following (region of interest is bolded):
nvcc -std=c++11 <b>-arch=sm_35</b> -Xptxas="-v" test.cu -o test
nvcc -std=c++11 <b>-arch=compute_35 -code=sm_35,sm_52</b> -Xptxas="-v" test.cu -o test
nvcc -std=c++11 <b>-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_52</b> -Xptxas="-v" test.cu -o test
where ‘test.cu’ contains a very simple kernel and a main that includes the following code:
Timer JTimer; printf("JIT compiling..\n"); cudaSetDevice(0); /* first cuda function */ printf("JIT finished - took %f seconds\n", JTimer.elapsed());
The problem is that when both GPUs are in the same linux workstation I cannot seem to avoid a large cuda startup cost.
I am recording a delay of ~6-7 seconds during startup/cuda context creation (double checked with a physical stopwatch) when using either device and any of the above commands. I have read the CUDA GPU Compilation Docs and can’t quite figure out what is wrong with the 2nd or 3rd commands. Any ideas/advice on how to shorten the startup time with two different (but known) devices on the same machine?
Thanks in advance!