How can I query the available CUDA GPU device numbers in linux environment? That is, knowing the total number of available devices is not enough (cudaGetDeviceCount), I need the CUDA API to tell me the actual available device numbers.
The above would be needed in our multi-user CPU-GPU cluster w/ several CPU compute hosts w/ 2 GPUs each. We need to accommodate hybrid MPI+CUDA codes, where the MPI process on a compute host has to know which GPU device to use for CUDA computing, when the other device could be taken by some other user.
As a follow-up to the above, has anyone done SGE + CUDA queue and PE configuration beyond simple all.q? We would want to allocate SGE resources based on available GPUs, and pin / reserve a GPU device to a SGE job instance.
Once you have the count of devices, you can call cuDeviceGet() (if you’re using the driver api…check the reference for the runtime call) to get a pointer to to a specific device within the range [0, X], where X is the number returned by the cuDeviceCount() method. Once you have the device pointer, you can call cuDeviceGetName() with it to get the name of the device, or ceDeviceGetProperties() to get it’s other properties. You can do those last few steps in a loop after you get the device count if you want to get the information for all of the devices in the system.
Yeah. In 2.2 under Linux, you can use nvidia-smi to designate a GPU as supporting multiple contexts, a single context, or no contexts. You can query this in CUDART, plus we give you some convenience features to make this easy. So, you have multiple GPUs and multiple MPI processes that need GPUs–no problem. Set all your GPUs to single context mode (aka compute exclusive mode), don’t call cudaSetDevice() (or you call the new function to set the valid list of possible devices), and run your app. One process will grab the first GPU, the other will try the first GPU and fail because the context already exists, then (assuming you use CUDART) it will silently retry and create a context on the next GPU. Once you’re out of GPUs, context creation will fail.
In other words, all of the problems that Seppo mentioned just go poof and disappear.
That sounds brilliant! Just what the doctor ordered. Pity I’m currently required to use CUDA in windows, but I’m sure I’ll go back to linux at the first opportunity.
Cool! We’ve asked and asked and you guys have finally delivered :) And in a way that will make cluster admins very happy. I’m looking forward to tearing down all the ad-hoc and poorly debugged scripts that implemented this functionality client-side.
yes and yes. doesn’t actually matter if the two contexts are created from the same thread and just push/popped with context migration APIs, different threads in the same process, or different processes–it’s just a restriction on the number of contexts that can exist on a GPU at a time.
Is similar functionality available in the driver API? I.e. if a GPU is set to a single context, will cuCtxCreate() fail on that device? I suppose I can just try the next device until I either run out or find one?
Since CUDA 1.1, we’ve been rolling our own solution too, but it’s had some other advantages. We created a wrapper library that only exposes the GPUs allocated to that user by a scheduler/batch system, but it is probably best described just by posting a relevant section from the user readme below.
CUDA Wrapper USER readme
Overview:
The CUDA wrapper library is typically implemented in a forced preload, such
that the device allocation calls to CUDA are intercepted by it for a few
different benefits. Only users requesting multiple GPUs per node really need
to be aware of it’s transparent operation. The wrapper libary accomplishes
three things:
Virtualizes the physical GPU devices to a dynamic mapping, that is always
zero indexed. The virtual devices visible to the user map to a consistent
set of physical devices, which accomplishes “user fencing” on shared
systems and prevents users from accidentally trampling one another.
Rotates the virtual to physical mapping for each new process that requests
a GPU resource. This provides a method for large parallel tasks to use
common startup parameters and still use multiple device targets. i.e.
When each new process calls for gpu0, the underlying physical device gets
shifted, or rotated if you will, allowing for the next process calling for
gpu0 to get the next allocated physical device. Please note that rotation
does not occur for new threads within a single process- only for new
processes. CAUTION Users accustomed to targeting gpu0, gpu1, etc with
different processes on systems without this wrapper must understand this
feature to avoid trampling their own processes. e.g. If you have two
GPU devices allocated, and you launch two processes, one targeted to gpu0,
and the other targeted to gpu1- both processes will be using the same gpu
device! Call them each against gpu0 unless they’re different threads
within a single process.
NUMA affinity, if relevant, can be mapped between CPU cores and GPU
devices. This has been shown to have as much as 25% improvement in host
to device memory bandwidth. This feature is transparent.
There is a link to download it on this page (search for CUDA Wrapper Library):
Yeah, Jeremy’s library is a big part of why we’ve gone with exclusive mode as opposed to something else–it seems to work well and people like it. Exclusive mode gets you #1 and #2 easily enough. #3 is coming in a future driver release.
with a good solution that we have tested with SGE. With this, all you need is a termination script or clean script in the case of our process has aborted.