CUDA has much time to setdevice

I have a gpu machine (Tesla P100 6 GPUs) with windows 2016, CUDA 9.1

Nowdays It has much time to setdevice API.

when I execure vectorADD.exe, program upset to me to show more 20 sec. to call setdev API.

Why it happens? Is it bug on CUDA 9.1?

It’s probably CUDA initialization overhead. The fact that you have 6 GPUs in the machine doesn’t help. If the machine has a lot of host memory, that can also make things slower. It’s probably not specific to cudaSetDevice(), that just happens to be the first CUDA runtime API call in the program, so it “absorbs” the CUDA initialization overhead.

If you only need to use less than all 6 GPUs for a particular program, you may be able to reduce the initialization overhead by using the CUDA_VISIBLE_DEVICES environment variable to limit CUDA runtime initialization activity. Setting persistence mode on the GPU(s) may also help.

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars[/url]

txbob gave all the immediately applicable practical advice there is.

On a slightly more elevated level: CUDA host-side overhead can be reduced by choosing CPUs with very high single-thread performance (which to first approximation means a high clock rate; I recommend >= 3.5 GHz) and fast, high-throughput system memory (in practice, as many channels of DDR4-2666 as you can afford).

A main contributor to CUDA startup overhead is the mapping of all system memory and all GPU memory into a single virtual address map. The time to do this is largely spent in various OS calls that to my knowledge do not allow for parallelization, therefore single-thread performance is important.