How to query device #s of available GPU devices?

Hi all,

How can I query the available CUDA GPU device numbers in linux environment? That is, knowing the total number of available devices is not enough (cudaGetDeviceCount), I need the CUDA API to tell me the actual available device numbers.

The above would be needed in our multi-user CPU-GPU cluster w/ several CPU compute hosts w/ 2 GPUs each. We need to accommodate hybrid MPI+CUDA codes, where the MPI process on a compute host has to know which GPU device to use for CUDA computing, when the other device could be taken by some other user.

As a follow-up to the above, has anyone done SGE + CUDA queue and PE configuration beyond simple all.q? We would want to allocate SGE resources based on available GPUs, and pin / reserve a GPU device to a SGE job instance.


Once you have the count of devices, you can call cuDeviceGet() (if you’re using the driver api…check the reference for the runtime call) to get a pointer to to a specific device within the range [0, X], where X is the number returned by the cuDeviceCount() method. Once you have the device pointer, you can call cuDeviceGetName() with it to get the name of the device, or ceDeviceGetProperties() to get it’s other properties. You can do those last few steps in a loop after you get the device count if you want to get the information for all of the devices in the system.

Just wait for 2.2 final…

Does that mean Seppo nailed it? The thing that is better than any of the other 2.2 improvements but won’t appear until 2.2 final?

Yeah. In 2.2 under Linux, you can use nvidia-smi to designate a GPU as supporting multiple contexts, a single context, or no contexts. You can query this in CUDART, plus we give you some convenience features to make this easy. So, you have multiple GPUs and multiple MPI processes that need GPUs–no problem. Set all your GPUs to single context mode (aka compute exclusive mode), don’t call cudaSetDevice() (or you call the new function to set the valid list of possible devices), and run your app. One process will grab the first GPU, the other will try the first GPU and fail because the context already exists, then (assuming you use CUDART) it will silently retry and create a context on the next GPU. Once you’re out of GPUs, context creation will fail.

In other words, all of the problems that Seppo mentioned just go poof and disappear.

That sounds brilliant! Just what the doctor ordered. Pity I’m currently required to use CUDA in windows, but I’m sure I’ll go back to linux at the first opportunity.

Cool! We’ve asked and asked and you guys have finally delivered :) And in a way that will make cluster admins very happy. I’m looking forward to tearing down all the ad-hoc and poorly debugged scripts that implemented this functionality client-side.

Great !!! :)

Btw - you say

Does this also apply to a single process with multiple threads, each using a different GPU? and of course will also work on GTX295 as well?



yes and yes. doesn’t actually matter if the two contexts are created from the same thread and just push/popped with context migration APIs, different threads in the same process, or different processes–it’s just a restriction on the number of contexts that can exist on a GPU at a time.

Is similar functionality available in the driver API? I.e. if a GPU is set to a single context, will cuCtxCreate() fail on that device? I suppose I can just try the next device until I either run out or find one?



cuCtxCreate will fail. There’s no syntactic sugar in the driver API to handle multiple devices, but it behaves exactly as you would expect.

Since CUDA 1.1, we’ve been rolling our own solution too, but it’s had some other advantages. We created a wrapper library that only exposes the GPUs allocated to that user by a scheduler/batch system, but it is probably best described just by posting a relevant section from the user readme below.

CUDA Wrapper USER readme


The CUDA wrapper library is typically implemented in a forced preload, such

that the device allocation calls to CUDA are intercepted by it for a few

different benefits. Only users requesting multiple GPUs per node really need

to be aware of it’s transparent operation. The wrapper libary accomplishes

three things:

  1. Virtualizes the physical GPU devices to a dynamic mapping, that is always

    zero indexed. The virtual devices visible to the user map to a consistent

    set of physical devices, which accomplishes “user fencing” on shared

    systems and prevents users from accidentally trampling one another.

  2. Rotates the virtual to physical mapping for each new process that requests

    a GPU resource. This provides a method for large parallel tasks to use

    common startup parameters and still use multiple device targets. i.e.

    When each new process calls for gpu0, the underlying physical device gets

    shifted, or rotated if you will, allowing for the next process calling for

    gpu0 to get the next allocated physical device. Please note that rotation

    does not occur for new threads within a single process- only for new

    processes. CAUTION Users accustomed to targeting gpu0, gpu1, etc with

    different processes on systems without this wrapper must understand this

    feature to avoid trampling their own processes. e.g. If you have two

    GPU devices allocated, and you launch two processes, one targeted to gpu0,

    and the other targeted to gpu1- both processes will be using the same gpu

    device! Call them each against gpu0 unless they’re different threads

    within a single process.

  3. NUMA affinity, if relevant, can be mapped between CPU cores and GPU

    devices. This has been shown to have as much as 25% improvement in host

    to device memory bandwidth. This feature is transparent.

There is a link to download it on this page (search for CUDA Wrapper Library):

Also included in it is a memory scrubber utility, which we run between user jobs so that userB can’t read out whatever userA left in the GPU memory.

Jeremy Enos


Yeah, Jeremy’s library is a big part of why we’ve gone with exclusive mode as opposed to something else–it seems to work well and people like it. Exclusive mode gets you #1 and #2 easily enough. #3 is coming in a future driver release.


Until 2.2,

you can watch

with a good solution that we have tested with SGE. With this, all you need is a termination script or clean script in the case of our process has aborted.

Best regards,

Guillermo Andrade

actually exclusive mode is available right now (before 2.2 final toolkit comes out) using nvidia-smi and 185.18.04.