Avoiding JIT compiling on system with 2 different GPUs

Hello all,

I would like to avoid JIT compiling for a list of expected GPU’s, and instead fall back on JIT compiling if an unexpected GPU is used. Specifically, I am trying to compile some cuda kernels for sm_35 (Tesla K40c) and sm_52 (Tesla M40) using cuda 7.5. I only need features from sm_35. I have tried the following (region of interest is bolded):

nvcc -std=c++11 <b>-arch=sm_35</b> -Xptxas="-v" test.cu -o test
nvcc -std=c++11 <b>-arch=compute_35 -code=sm_35,sm_52</b> -Xptxas="-v" test.cu -o test
nvcc -std=c++11 <b>-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_52</b> -Xptxas="-v" test.cu -o test

where ‘test.cu’ contains a very simple kernel and a main that includes the following code:

Timer JTimer;
printf("JIT compiling..\n");
cudaSetDevice(0);  /* first cuda function */
printf("JIT finished - took %f seconds\n", JTimer.elapsed());

The problem is that when both GPUs are in the same linux workstation I cannot seem to avoid a large cuda startup cost.

I am recording a delay of ~6-7 seconds during startup/cuda context creation (double checked with a physical stopwatch) when using either device and any of the above commands. I have read the CUDA GPU Compilation Docs and can’t quite figure out what is wrong with the 2nd or 3rd commands. Any ideas/advice on how to shorten the startup time with two different (but known) devices on the same machine?

Thanks in advance!

How did you establish that the time you are measuring is due to JIT compilation and not other aspects of CUDA context creation? Are you on a Linux platform? Have you turned on persistence mode to prevent the unloading of the CUDA driver when not in use? Does the system have a large amount of system memory?

JIT compilation is avoided by building a fat binary that SASS binaries for all supported platforms. Use cuobjdump --dump-sass to double-check that they are there.

Unless I overlook something, the following should give you the fat binary you desire:

-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_52

Thanks njuffa, you are correct. All the binaries are there. If this is the case, how would I go about exploring other aspects of CUDA context creation in an attempt to shorten the startup time? Is this an acceptable startup time?

Thanks again!

I already asked several questions relevant to that in post #2 above, and it would be helpful if you could provide answers to them.

Sorry, not sure how I missed those!

How did you establish that the time you are measuring is due to JIT compilation and not other aspects of CUDA context creation?

I didn’t really. My research showed that it was the likely cause, but the SASS dump proves otherwise.

Are you on a Linux platform?

Yes, RHEL7.

Have you turned on persistence mode to prevent the unloading of the CUDA driver when not in use?

The persistence daemon looks promising. Unfortunately, I don’t have root access on the remote machine, and so I will need to get in touch with the system admin to test this.

Does the system have a large amount of system memory?

32 GB physical memory

I now see it the docs that 1-3 seconds per device is normal due to ECC scrubbing. I will pursue persistent drivers and post back if it works. Thanks again!

Definitely give the persistence daemon a try (I didn’t realize until now that the classical persistence mode had been EOLed).

32 GB is not what I would call large system memory. For implementation of a unified address space, the CUDA driver needs to map both CPU and GPU memory into a single address map at startup, this takes longer if there is more CPU and GPU memory to be mapped. Some of the necessary work relies on OS calls that are basically single-threaded activities, so use of a CPU with high single-thread performance (my recommendation: CPU base clock > 3.5 GHz) helps somewhat.

The persistence daemon decreased startup time from ~5 seconds to ~10 milliseconds, which should be small enough for my needs. This was exactly what I was looking for, thanks for your time!