Recently, I developed a multi-GPU program based on my tesla S1070, I use cudaSetDevice() with each host thread, then I time the cudaSetDevice() using cuda timer, weird things happen, cudaSetDevice() take a so long time, about 1700~ ms, some cudpp function is used in the code, but in the some other multi-GPU program, it takes 40-70 ms without cudpp function. CUDA toolkit version is 3.1.
On the gtx 480, cudaSetDevice() takes 170, cuda toolkit version is 3.0, when I upgrade to 3.1, it takes 280 ms.
Someone can tell me why this happened?
THX in advance.
cudaSetDevice() isn’t really taking 1700 ms. It’s the context setup overhead that you get on the first CUDA call of almost any type . Yes, it really can take a second or even two. After a context is set up, cudaSetDevice() (and other CUDA calls) are reasonably fast.
Try running nvidia-smi in a loop in the background before you run your app. (one of these days I’ll improve it, I just have to remember it when I have a modicum of free time…)
You mean that before cudaSetDevice()(the first CUDA calls), host thread must setup context, and most part of my 1700ms was spend on this context setup, and not really on cudaSetDevice()?
It isn’t anything to do with contexts (like Tim said, cudaSetDevice() doesn’t establish a context), but rather with the driver itself. The NVIDIA driver seems to like unloading internal modules and freeing resources “automagically” after a period of inactivity, and I think most of that time you are measuring is the driver re-loading and re-initialising everything. Running nvidia-smi in daemon mode keeps an API client attached and prevents the driver from unloading everything.