cudaSetDevice() time, so weird! cudaSetDevice() take a long time.

Hi, guys,

Recently, I developed a multi-GPU program based on my tesla S1070, I use cudaSetDevice() with each host thread, then I time the cudaSetDevice() using cuda timer, weird things happen, cudaSetDevice() take a so long time, about 1700~ ms, some cudpp function is used in the code, but in the some other multi-GPU program, it takes 40-70 ms without cudpp function. CUDA toolkit version is 3.1.

On the gtx 480, cudaSetDevice() takes 170, cuda toolkit version is 3.0, when I upgrade to 3.1, it takes 280 ms.

Someone can tell me why this happened?
THX in advance.

cudaSetDevice() isn’t really taking 1700 ms. It’s the context setup overhead that you get on the first CUDA call of almost any type . Yes, it really can take a second or even two. After a context is set up, cudaSetDevice() (and other CUDA calls) are reasonably fast.

cudaSetDevice doesn’t set a context.

Try running nvidia-smi in a loop in the background before you run your app. (one of these days I’ll improve it, I just have to remember it when I have a modicum of free time…)

Thanks for your reply.

I time it like this:

cutStartTimer(timer);

							cudaSetDevice(0);

							cutStopTimer(timer);

You mean that before cudaSetDevice()(the first CUDA calls), host thread must setup context, and most part of my 1700ms was spend on this context setup, and not really on cudaSetDevice()?

Ok, tim, what should I do using nvidia-smi ?

when I use nvidia-smi as fllows, my gpu: tesla S1070 + gtx8400 for display:

nvidia-smi -g 0 -c 1 &

		  nvidia-smi -g 1 -c 1 &

		  nvidia-smi -g 2 -c 1 &

		  nvidia-smi -g 3 -c 1 &

		  nvidia-smi -g 4 -c 1 &

I get cuda runtime API error at the first cudaMalloc(), it shows that “all CUDA-capable devices are busy or unavailable.”

No, don’t do anything with exclusive mode or anything like that, just do nvidia-smi -l.

It isn’t anything to do with contexts (like Tim said, cudaSetDevice() doesn’t establish a context), but rather with the driver itself. The NVIDIA driver seems to like unloading internal modules and freeing resources “automagically” after a period of inactivity, and I think most of that time you are measuring is the driver re-loading and re-initialising everything. Running nvidia-smi in daemon mode keeps an API client attached and prevents the driver from unloading everything.

Interesting, I always thought that first-cuda-call delay was a per-process setup overhead, it’s interesting that it’s a systemwide setup.

Thanks as always Tim for teaching us the little details!

Thanks very much, tim.

So clearly, now I know where is the 1700ms come from. Thanks, avidday.

Might be able to sneak an improvement into 3.2. We’ll see!