First CUDA function call very slow (more than a minute) on GTX 680 only

The first CUDA function I call (after cudaSetDevice) takes about 65 seconds to run on a GTX 680, but only 400 ms on a GTX 580. I’ve found this to be true on several computing systems, with different GPUs.

I read at http://stackoverflow.com/questions/15166799/any-particular-function-to-initialize-gpu-other-than-the-first-cudamalloc-call that you can call cudaFree(0) to force CUDA to do its initialization. However, this is just as slow: this cudaFree(0) call takes more than a minute to run on the GTX 680. During this minute, the process is at near 100% CPU usage.

Is this slow initialization a known problem with the GTX 680? Is there a way to speed up the CUDA initialization? I’ve tried this with both CUDA 5.0 and 5.5.

Are the numbers 65 seconds vs 400 ms the data from a controlled experiment, that is, did you use the same system with the same software, with the only change being replacement of GTX 580 with GTX 680?

Is the CUDA software in question compiled as a fat binary containing SASS code for both sm_20 (GTX 580) and sm_30 (GTX 680)? For example, if the executable contains only SASS (i.e. machine code) for sm_20 plus PTX, there could be significant overhead for JIT-compiling PTX to sm_30 SASS on the GTX 680. The 100% CPU utilization you observe would be consistent with that hypothesis.

njuffa,

I’m compiling with -arch=sm_20. Tomorrow I’ll try compiling for both sm_20 and sm_30, and I’ll see what happens. I didn’t realize that there would be overhead converting between the two. I have more than a hundred kernels compiled into this binary.

This system has both a GTX 580 and a GTX 680; I just changed cudaSetDevice() between the two tests. It had the same behavior when I physically swapped GPU positions (though I didn’t do a controlled experiment).

Thanks for your help!

Are you running on a headless Linux system? If so, try doing “nvidia-smi -pm 1” at the command prompt while running as root. This will trigger the OS kernel GPU driver to prepare the GPU(s) for use.

Compiling with “-gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30” resolved the problem: no more long initialization for the GTX 680. Thanks, njuffa. This has been bugging me for a long time.

Arakageeta, this is Linux but currently not headless. I’ll keep that command in mind, though.