cuda API request: cudaSetDeviceLeastUsed()

We’ve recently moved our second 8800 GTX into the same server as our first GTX card, and we need a way to ensure concurrent CUDA jobs don’t use the same card.

I’d like to request a new Device Mangement function to make this easier:

cudaError_t cudaSetDeviceLeastUsed(int *dev);

Sets the active host thread to run device code on the CUDA device which is being used by the fewest host processes.  The selected device number is returned in *dev.

This function is atomic, so if there are N available devices and N processes call cudaSetDeviceLeastUsed() at the same time, they are all guaranteed to be assigned to different devices.

For the time being, were are adding this logic into the program we run most frequently, but a global way to do this right for all CUDA jobs would be nice.


Over the last few months I’ve requested (in the NVIDIA bug/RFE system) a number CUDA features of along these lines to assist with cluster-based CUDA runs. As you can imagine, things get interesting when multiple independent CUDA jobs get scheduled onto the same node, and they all want to use all of the CUDA devices :-)

In my various feature requests, I have asked for implementation of a couple of alternative things:

  1. an exclusive-open call, such that the device will not show up as available to other jobs

  2. calls to determine how “busy” or active a GPU is

  3. calls to reserve GPU resources, e.g. global memory, so that when two jobs do share a device, they can reserve the amount of memory they need up front, and prevent other jobs from interfering with their operation, once started.

  4. methods for settings limits akin to what one does in Unix for the CPU/memory/stack/etc, but for the GPU. Ideally something that can interact with a queueing system, for example.

These were just my crude suggestions, I don’t know how practical any of them are, we’ll see what the NVIDIA engineers come up with. At the very least they now know that others besides those of us at UIUC also want these things :-)


John Stone

Ah, those are even better suggestions. The exclusive open call would exactly solve our problem.

I have not experimented much with multiple user processes running on the same CUDA device. Early on with CUDA 0.8 I tested this briefly and found that two processes using the same device ran much, much slower than you would expect (this was a dual-core system with a single 8800 GTX). I haven’t tried again since that time, though.

We hadn’t intended to be experimenting with multiple users sharing cards, but we quickly ran into this once a larger number of people began using the GPU clusters here at UIUC and UNC Chapel Hill… :-)

Without some sort of exclusive access API, the only way to avoid this problem to set the queueing system such that users are only able to allocate entire nodes at a time. On the UIUC cluster that’s not really an option though, as the same nodes contain both GPUs and FPGAs, so there are different people running jobs on the different accelerator devices, making things slightly more wacky…