how can I tell if a GPU is busy?

My application is written in Matlab using parfor parallelization. This means that each worker is in a separate process. Each process has its own CUDA context. Because I have more processes than GPUs, my function often needs to wait for a GPU when it wants to do some work.

For better overall throughput, I want to perform my time consuming computation on the CPU if the GPU is busy working on behalf of another process. In my mex function, I want to implement logic like the following:

if( gpuIsBusy() )

    computeOnCpu()

else

    computeOnGpu()

But I don’t know how to write the gpuIsBusy function. This might be easy if everything were in a single process, I could use a critical section. To communicate between processes, I suppose I could use a named semaphore, but I’m hoping there’s a cuda call to make it easier.

nvidia-smi reports GPU utilization as a percentage (duty cycle of kernels running). But I’m not aware of this number being available via an API call. Even though, you’d likely run into race conditions checking it.

Another option is to use the compute exclusive mode. That only allows one process (or thread) to access a GPU at a time. The GPU to use is automatically selected, and you can tell if they are all busy when the first cuda* call that initializes the context (i.e. cudaFree(0)) returns an error code.

There has to be some way of load-balancing between CPUs and GPUs. I have not seen any answer to this problem.

Thanks for that note, Doc. You are correct that there would be a race condition. But that should still allow for much improved utilization in my case.

I wish I could figure out how to turn on “compute exclusive” mode in windows XP 64.

Yeah, windows/CUDA are a weird pair. In one way, windows is a first class CUDA citizen (and linux second) because windows has nsight. In another way, linux if first and windows is a distant third because on linux one can use nvidia-smi to set compute exclusive on any GPU and on windows, you need to have a Tesla card and run the TCC drivers. (Not to mention the same req. for new CUDA 4.0 features like UVA). I guess it all stems from Microsoft’s horrible WDDM driver model on windows.

The answer is simple for my code: it is not possible. Any attempt at load balancing would require PCIe communication every step which immeadiately slows my perf by a factor of 2.

I have seen some cases with extremely heavy compute usage take advantage of CPU/GPU load balancing, though. John Stone has implemented some of these in VMD - one is a case where particles are binned. The bin size is set such that 90% of the bins are full on the GPU, and the overfull bins are processed on the CPU. This problem also scales across many different perf GPUs and CPUs in parallel using his WorkForce worker thread model. The computation is broken up into tiles, and tiles are dynamically handed off to worker threads on demand (i.e. producer/consumer). A worker thread on a GTX 580 will process tiles at one rate, a slower GPU at another rate (i.e. display GPU), and the CPU at a third rate. If the tiles are small enough, the load balance is even. However, I really only see this applying in cases where you have a really huge number crunching problem that involves minimal data transfer.