No GPU selected, code working properly, how's this possible?

Dear all,

I’m a bit confused. I’m working on a project for which I have a system with two GPUs: a Quadro 600 for display purposes and a GTX Titan for computing.

I want to store data in constant memory (of the Titan) using cudaMemcpyToSymbol and my code seems to work properly. That is: I can write data and read it back, the input data matches the output and Cuda doesn’t return any errors.

Later I found out I forgot to select one of the GPUs (the Titan), so I’m very surprised the code worked at all. Where does the data go?

Any help is appreciated.



My understanding is that by default CUDA will query all CUDA capable GPUs, and then select the best of the bunch based on the compute capability.

On my machine I have a 680 and a K20, and it just will use the K20 for calcs unless I tell it otherwise.

I suspected as much, I just wasn’t able to find any documentation on this.


Note that if you have several GPUs, you should not rely on the driver’s ordering to select the “best” card. On a system with 4 GPUs, I see this ordering:

Device 0: GTX 580
Device 1: GTX 275 (connected to display)
Device 2: GTX 680
Device 3: GTX 580

My guess is that the driver tries to ensure that CUDA device 0 is not the display device when there is more than one device available, but the ordering beyond that is arbitrary.

That does make sense, but if something like this happened:

Device 0:Titan(video out)
Device 1: GTX 460

would it by default use the GTX 460 for CUDA?

Not that the above is the correct configuration, it is not, just wondering which would be default chosen when the difference is large in capability.

By default, CUDA kernels execute on device ID 0.

You can check which device has ID 0 for your system by using deviceQry.

Concerning the rationale for ID assignment, here are two apparently contradictory documents on the web, namely


I’m not sure if things changed across different CUDA versions/drivers.

I don’t recall seeing the device order change after switching versions of CUDA (this particular computer has had CUDA 4-5.5 on it), but given the lack of specification, it is probably best to assume that it could change.

Thanks for all the feedback!

In my code I have a couple of lines to ensure my calculations are performed on the Titan. Basically, I query the device properties and select the device named “Titan”.

Maybe I should look into using NVML.

What I find surprising is that there are no straightforward ways to select a device based on a unique serial number or something similar.

Also CUDA deciding what device is fastest and making it device 0 seems a bit arbitrary. What happens if I have for example a system with multiple Titans. Which of them will then become device 0? And is the enumeration the same everytime I start the system?


I think my example above shows that CUDA does not map the fastest card to device 0 in general. The GTX 680 (for many, but not all) applications is a better card than the GTX 580, but is device 2.

The device properties structure is pretty extensive, and should let you create a device selection heuristic appropriate for your application based on compute capability, memory size, # of CUDA cores, whether or not a display is connected, etc. I don’t think you’ll need to use NVML to pick a CUDA device.

There does not appear to be a unique card serial number in the device property structure, but it does have fields for the PCI Express “coordinates” (domain, bus, device) of the device, which should be stable as long as the card is not moved to a different slot in the computer.

Hi Seibert,

You’re right about using the device properties to select the appropriate device. At the moment selection based on the device name suffices, but in the future I may have resort to using PCI BUS ID etc as well.

I’m curious, what parameter in the device properties structure indicates whether a display is connected to the device or not?


I agree with seibert that it is more probable that the device IDs are assigned according to a “physical location” of the device, instead to performance heuristics, opposite to the answer at

Indeed, what does “best performance” mean? Throughput? Memory?

Concerning selecting the “best” device, at

there is a code snippet to select the card with the largest number of multiprocessors, but also some CUDA SDK multi-gpu examples (p2p) have parts of the code to make such selection.

Today I have installed a PC with a Tesla C2050 card for computation and an old 8084 GS card for visualization, by switching their positions between the first two PCI-E slots. I have used deviceQuery and noticed that GPU 0 is always that in the first PCI slot and GPU 1 is always that in the second PCI slot. I do not know if this is a general rule, but it is a proof that, at least for my system, GPUs are numbered not according to their “power”, but to their positions.