cudaSetDevice question

I’ve started recently using 2 GPUs on one system. Is it reasonable that a cudaSetDevice would take 320ms ???
Is the cudaSetDevice method relevant for the thread calling it? so for the default device (device 0) I shouldn’t call it and save the time
while for threads working with the second GPU and up I should call it?


current device is set per-thread.
You may set it just once.

You have to call it only once, so I think that it is not that bad. It also creates a context (overhead you would otherwise see in the first kernel call or cudamalloc, so there is no way around it I believe)

Thanks a lot for the explaination :)

cudaSetDevice does not create a context.


then 320 ms sounds like a lot.

It might be getting the device list and such for the first time. Not sure.

Ok thanks… I’ll try to init this on startup and see what it gives



I have a system with more than one GPU (1 GTX 280 and two Tesla). I am using one of the SDK examples and am trying to set the device which is not being used. I obtained the following code from the website:…/Choosing_a_GPU (If this link does not open, try google cache!)


int setdevice()


int num_devices, device;


if (num_devices > 1) {

  int max_multiprocessors = 0, max_device = 0;

  for (device = 0; device < num_devices; device++) {

          cudaDeviceProp properties;

          cudaGetDeviceProperties(&properties, device);

          if (max_multiprocessors < properties.multiProcessorCount) {

                  max_multiprocessors = properties.multiProcessorCount;

                  max_device = device;






return device;



But it doesn’t work ! Two codes which run end up chosing the devise 0. How do I fix the problem? I have followed the instructions given on the website like not calling cudaInit(int argc, char **argv) function from my code. Is it because multiProcessorCount for all the devices are same? How do I check which device is being used and which device is free?



Why do you say that? The code obviously is choosing the device with the maximum number of multiprocessors (something you could do with a simple cudaChooseDevice btw). From your description, it is working as designed.

You can’t. This is a feature we have been begging for since almost 2 years ago. NVIDIA keeps saying, “we’re thinking about it”. If you want multiple jobs to run on separate GPUs, you need an external solution. I.e. lock files, IPC, or job queuing systems such as openPBS/torque or the sun grid engine.

Thanks for the response…

This is why: I first run an SDK code which outputs that the device 0 is being used.

Then I run this modified (particles SDK) code with setdevice function. Note that it returns the device number chosen. It says zero as well !

I am a little confused. If that is the case, how would any code which is supposed to chose the free device work? In the above code that I posted, does the multiProcessorCount depend on whether that device is being run or not?

Thanks again in advance !

There’s no way to check if device is ‘free’ or not. If your application is the only user of GPUs, you can implement so tracking at your level, something like MisterAnderson42 suggested before.

multiProcessorCount depends only on device type, i.e. it will always be 30 for GTX 280/Tesla C1060, no matter device is busy or not.

Thanks! So that means the setdevice function is not really useful for my case. I guess I will have to manually give a number for the cudaSetDevice to chose a GPU depending on what I have done for other running codes.