Folks, I have a Tesla box with 4 GPUs attached to an 8 CPU machine and I currently run some MPI code that can use all 8 CPUs and all 4 GPUs just fine. I manage the scheduling manually using MPI ranks, a mod call, and cudaSetDevice so that rank 0 gets GPU 0, rank 6 gets GPU 2 (=mod(6,4)), etc. Obviously overscheduling GPUs like this isn’t ideal since the two processes on each GPU are run serially (and timings bear that out), but it at least runs.
Recently, though, I decided to see how my code reacted to using exclusive mode. So, I ran the necessary nvidia-smi commands to get:
> nvidia-smi -s
COMPUTE mode rules for GPU 0: 1
COMPUTE mode rules for GPU 1: 1
COMPUTE mode rules for GPU 2: 1
COMPUTE mode rules for GPU 3: 1
and found that 1-, 2-, and 4-process runs of my code work just perfectly. However, when I try to run an 8-process job, the code crashes, which surprised me. I thought that the code would proceed as it did with the manually-managed case: over-scheduled GPUs would wait their turn. (Since, well, exclusive mode is just handling this for me.)
Is this the expected behavior with exclusive mode? That is, if you try to run more than one process on a GPU, it fails hard rather than waiting for the resource to free up? (I suppose process 4 looks for a free GPU, doesn’t find one, and kaboom!) Or is there some sort of synchronization or wait command I can issue before my calls to make sure that process 6 waits for process 2 to finish up with GPU 2 before doing copying and kernel execution?
ETA: Oh, I’m using CUDA 3.0 on this system, in case that matters.
In exclusive mode, the driver will just refuse to allow a context to be established and the API call which tried to establish the context will fail with an error code. You should check that in your code, otherwise the application will fail in an ungraceful fashion. There is no notion of queuing of anything like that at the driver level.
My solution has been to use Sun Grid Engine to manage the GPUs as a consumable resource. When I want to run something, I specify how many GPUs I want, and the SGE scheduler will only put the job to the hardware when the requested number of GPUs appears free.
Ah, I’d feared as much. Would be nice if it wasn’t so, but it’s not wholly unexpected.
Yeah, eventually I imagine the resources will be managed by PBS or the like as more people use them. Looks like for my coding, at least, I’ll go back to managing the resources myself for now since I have (nigh-)exclusive use of the machine.
But the day will come for me when # CPU must equal # GPU, so time to start thinking about the best ways to use those available idle CPUs!
On our cluster, we limit CPU jobs on the GPU nodes to 24 hours or less. Only jobs that use the GPU resources can run longer, managed via a QOS. That keeps the GPUs free and also guarantees that there are always CPUs free for short jobs, a win-win situation in a cluster that runs lots of each type of job.