Multiple GPUs, multiple applications

I am wondering how kernels are handed to GPUs in a system where you have multiple applications that use CUDA, and if the driver API is not used for setting the device. Will all the applicatiosn create a context that is attached to the same “default” GPU, or is the driver smart enough to create a context on the different GPUs that for instance are not used by other applications.

The reason I ask, is that I want multiple applications to offload data with CUDA concurrently, and I do not want to specify in each application what application uses a specific GPU, but rather a more dynamic approach.

Any ideas how this is done? My suspicion is that the multi GPU approach in CUDA at the moment is to create contexts in the same application to run on multiple GPUs, not having multiple applications use multiple GPUs.

In CUDA 2.1 and earlier:
Each application (or each thread within a single application, it doesn’t matter) needs to call cudaSetDevice to choose the device that will be run on. As for automating the choice, several solutions have been posted to the forums such as forced preload libraries to override cudaSetDevice(). Most of these have been targted at solving the situation where you have jobs run from a batch queue (PBS, SGE) and GPUs need to be scheduled.

If no cudaSetDevice() call is made, all contexts default to GPU 0.

The situation is not optimal. Thankfully, NVIDIA has fixed it :)

In CUDA 2.2 and later (which isn’t out quite yet):
see [url=“http://forums.nvidia.com/index.php?showtopic=93484&pid=524949&st=0&#entry524949”]http://forums.nvidia.com/index.php?showtop...mp;#entry524949[/url]

Hi MisterAnderson,

I was aware of the 2.2 feature, but seeing your post a thought came to me. How does nVidia insures there is no host/CPU application starvation?

What I mean is if the kernels are fast enough, it might be that one CPU thread/application might be constantly failing to get a valid context, no?

thanks

eyal

I don’t know what they actually do, so I’m only speculating.

Jim Phillips did some tests and found that CUDA 2.1 and earlier do a completely “fair” round-robin scheduling between all contexts active on a GPU.

What CUDA 2.2 is adding is not at the kernel call level, but at the context creation one. I.e. at the beginning of your program, you find a free GPU that doesn’t have any contexts on it and select that GPU to run billions of kernels on. The next application that comes along will be able to choose another GPU in the system and run on that one. The whole point is to avoid getting more than one application running on one GPU.

The only tricky issue is how to handle the race condition of two applications requesting at the same time. I’m curious to see how the CUDA API solves that issue.

I guess it would be more problematic if the host thread/host application keeps starting up and then shutting down. Every time the thread/app starts it will

request a “new” context.

If indeed I just open X threads, upon creation ask for a context and now run forever then that would be a more easier thing to do.

Still there is a matter of making sure (even in the simple case) that no thread will fail to get a device.

eyal

There’s not really any sort of required API behavior to solve that issue. We’re careful about what exclusive mode actually does–it doesn’t check for a free device then create a context or anything like that, it creates a context only if there is no context already in existence. There are no race conditions involved that could cause things to go wrong and cause invalid behavior–that is, no device can possibly end up with two contexts.

Is it possible to get starvation? Sure. Write a utility function to create a context via cudaFree(0) that inserts a microsleep to try and create a context again if it fails.

Hmm, well I misread your original description. I read:

as being that the caller can request a list of available devices. Instead, now that I reread it, I see that the caller specifies a list of devices that it wants to run on and CUDART finds the first free GPU it can within that list.

Sorry for spreading confusion!

The way it works is not a problem for me as 99.999% of the time I will be running on machines with the exclusive mode set. But there are potential use cases (i.e., running benchmarks on my development box) where the ability to use the multiple context mode but still being able to choose a free device could be useful… Although now that I think about that, I could just set one of the 2 GPUs in the development box as exclusive for benchmarks and leave the other non-exclusive for general use in testing and debugging.

In that case, please don’t tell me that using nvidia-smi to set the exclusive mode is a Tesla only feature…

exclusive mode is not a Tesla only feature because of people like you. :)

We’ve considered adding a second mode (conditional context creation if there’s no context in existence when cuCtxCreate calls the kernel module) for cases where turning on exclusive mode isn’t a good thing, but we’re not sure it’s useful yet. If you find that exclusive mode isn’t quite flexible enough for some reason, let me know and we can try to get that in.

I will soon have a use case where I want to open a context on the device with fewest contexts already open. This is for a program where I have several processes using a CUDA device intermittently, so sharing devices is not a problem. A superset of the new functionality would cover this case as well.

man, don’t have more than one context on a device, it just causes hassle :( (I guess if you have several processes it can’t be avoided, but blah all the same.)

I’ve proposed something that will let you implement this kind of functionality yourself (it’s not as clean as you’d like, but I am also trying to make it very clear through the API that fewest number of contexts != most free GPU except when you guarantee that yourself with a lot of a priori knowledge), but I don’t know when you’ll get it.

also: argghhhh multiple contexts on a GPU, foaming at the mouth

I tested it fully realizing the gun was loaded and aimed at my foot, but it works way better than I anticipated for my particular usage. This software is used in very limited cases (not making a commercial or general use product here, just looking for dark matter), so I don’t expect to drive any CUDA updates.

Multiple contexts on one GPU is a bad, bad idea for a “CUDA application,” but not necessarily bad for “an application that happens to use CUDA.” :)