cudaSetDevice() time, so weird! cudaSetDevice() take a long time.

Austin · July 30, 2010, 2:45am

Hi, guys,

Recently, I developed a multi-GPU program based on my tesla S1070, I use cudaSetDevice() with each host thread, then I time the cudaSetDevice() using cuda timer, weird things happen, cudaSetDevice() take a so long time, about 1700~ ms, some cudpp function is used in the code, but in the some other multi-GPU program, it takes 40-70 ms without cudpp function. CUDA toolkit version is 3.1.

On the gtx 480, cudaSetDevice() takes 170, cuda toolkit version is 3.0, when I upgrade to 3.1, it takes 280 ms.

Someone can tell me why this happened?
THX in advance.

SPWorley · July 30, 2010, 3:22am

cudaSetDevice() isn’t really taking 1700 ms. It’s the context setup overhead that you get on the first CUDA call of almost any type . Yes, it really can take a second or even two. After a context is set up, cudaSetDevice() (and other CUDA calls) are reasonably fast.

tmurray · July 30, 2010, 3:29am

cudaSetDevice doesn’t set a context.

Try running nvidia-smi in a loop in the background before you run your app. (one of these days I’ll improve it, I just have to remember it when I have a modicum of free time…)

Austin · July 30, 2010, 6:28am

Thanks for your reply.

I time it like this:

cutStartTimer(timer);

							cudaSetDevice(0);

							cutStopTimer(timer);

You mean that before cudaSetDevice()(the first CUDA calls), host thread must setup context, and most part of my 1700ms was spend on this context setup, and not really on cudaSetDevice()ï¼Ÿ

Austin · July 30, 2010, 7:26am

Ok, tim, what should I do using nvidia-smi ?

when I use nvidia-smi as fllows, my gpu: tesla S1070 + gtx8400 for display:

nvidia-smi -g 0 -c 1 ï¼†

		  nvidia-smi -g 1 -c 1 ï¼†

		  nvidia-smi -g 2 -c 1 ï¼†

		  nvidia-smi -g 3 -c 1 ï¼†

		  nvidia-smi -g 4 -c 1 ï¼†

I get cuda runtime API error at the first cudaMalloc(), it shows that “all CUDA-capable devices are busy or unavailable.”

tmurray · July 30, 2010, 9:20am

No, don’t do anything with exclusive mode or anything like that, just do nvidia-smi -l.

avidday · July 30, 2010, 9:36am

It isn’t anything to do with contexts (like Tim said, cudaSetDevice() doesn’t establish a context), but rather with the driver itself. The NVIDIA driver seems to like unloading internal modules and freeing resources “automagically” after a period of inactivity, and I think most of that time you are measuring is the driver re-loading and re-initialising everything. Running nvidia-smi in daemon mode keeps an API client attached and prevents the driver from unloading everything.

SPWorley · July 30, 2010, 6:19pm

Interesting, I always thought that first-cuda-call delay was a per-process setup overhead, it’s interesting that it’s a systemwide setup.

Thanks as always Tim for teaching us the little details!

Austin · August 2, 2010, 2:55am

Thanks very much, tim.

Austin · August 2, 2010, 2:57am

So clearly, now I know where is the 1700ms come from. Thanks, avidday.

tmurray · August 2, 2010, 3:30am

Might be able to sneak an improvement into 3.2. We’ll see!

Topic		Replies	Views
cudaSetDevice question CUDA Programming and Performance	12	33478	February 3, 2009
Tesla C1060 and cudaSetDevice CUDA Programming and Performance	2	6430	July 2, 2009
cudaSetDevice() too slow CUDA Programming and Performance	1	3276	March 31, 2009
CUDA has much time to setdevice CUDA Programming and Performance	2	864	July 24, 2018
cudaSetDevice in each device function call? CUDA Programming and Performance	3	4860	August 24, 2009
Slow Initialization CUDA Programming and Performance	7	2849	July 30, 2009
Context Creation Still Taking Time Context Creation Still Taking Time on CUDA 2.2 and Tesla CUDA Programming and Performance	0	2660	July 1, 2009
Quick Question on cudaSetDevice()? It does not work in my case. CUDA Programming and Performance	5	12007	November 20, 2009
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23373	July 8, 2011
really slow cudaGetDeviceCount() several seconds to complete a cudaGetDeviceCount() call CUDA Programming and Performance	3	1336	May 18, 2011

cudaSetDevice() time, so weird! cudaSetDevice() take a long time.

Related topics