Exclusive Mode and More CPUs Than GPUs Can I overschedule GPUs in exclusive mode?

TheMatt · June 16, 2010, 12:07pm

Folks, I have a Tesla box with 4 GPUs attached to an 8 CPU machine and I currently run some MPI code that can use all 8 CPUs and all 4 GPUs just fine. I manage the scheduling manually using MPI ranks, a mod call, and cudaSetDevice so that rank 0 gets GPU 0, rank 6 gets GPU 2 (=mod(6,4)), etc. Obviously overscheduling GPUs like this isn’t ideal since the two processes on each GPU are run serially (and timings bear that out), but it at least runs.

Recently, though, I decided to see how my code reacted to using exclusive mode. So, I ran the necessary nvidia-smi commands to get:

> nvidia-smi -s

COMPUTE mode rules for GPU 0: 1

COMPUTE mode rules for GPU 1: 1

COMPUTE mode rules for GPU 2: 1

COMPUTE mode rules for GPU 3: 1

and found that 1-, 2-, and 4-process runs of my code work just perfectly. However, when I try to run an 8-process job, the code crashes, which surprised me. I thought that the code would proceed as it did with the manually-managed case: over-scheduled GPUs would wait their turn. (Since, well, exclusive mode is just handling this for me.)

Is this the expected behavior with exclusive mode? That is, if you try to run more than one process on a GPU, it fails hard rather than waiting for the resource to free up? (I suppose process 4 looks for a free GPU, doesn’t find one, and kaboom!) Or is there some sort of synchronization or wait command I can issue before my calls to make sure that process 6 waits for process 2 to finish up with GPU 2 before doing copying and kernel execution?

ETA: Oh, I’m using CUDA 3.0 on this system, in case that matters.

avidday · June 16, 2010, 12:14pm

In exclusive mode, the driver will just refuse to allow a context to be established and the API call which tried to establish the context will fail with an error code. You should check that in your code, otherwise the application will fail in an ungraceful fashion. There is no notion of queuing of anything like that at the driver level.

My solution has been to use Sun Grid Engine to manage the GPUs as a consumable resource. When I want to run something, I specify how many GPUs I want, and the SGE scheduler will only put the job to the hardware when the requested number of GPUs appears free.

TheMatt · June 16, 2010, 12:47pm

Ah, I’d feared as much. Would be nice if it wasn’t so, but it’s not wholly unexpected.

Yeah, eventually I imagine the resources will be managed by PBS or the like as more people use them. Looks like for my coding, at least, I’ll go back to managing the resources myself for now since I have (nigh-)exclusive use of the machine.

But the day will come for me when # CPU must equal # GPU, so time to start thinking about the best ways to use those available idle CPUs!

MisterAnderson42 · June 16, 2010, 12:56pm

On our cluster, we limit CPU jobs on the GPU nodes to 24 hours or less. Only jobs that use the GPU resources can run longer, managed via a QOS. That keeps the GPUs free and also guarantees that there are always CPUs free for short jobs, a win-win situation in a cluster that runs lots of each type of job.

Topic		Replies	Views
Multiple GPUs, multiple applications CUDA Programming and Performance	10	10025	April 22, 2009
configuring workload manager on cluster with Nvidia Tesla s1070 CUDA Programming and Performance	4	3123	July 26, 2009
How can I tell whether an EXCLUSIVE_PROCESS-mode GPU is "taken" or not? CUDA Programming and Performance cuda , nvidia-smi , nvml	7	1781	November 22, 2023
How to do GPU allocation in N GPU + M process env CUDA Programming and Performance	6	7515	October 10, 2008
nvidia-smi shows last GPU K80 (out of 8) is always busy CUDA Setup and Installation	8	2128	December 18, 2017
CPU Cores Per GPUs CUDA Programming and Performance	11	2463	April 14, 2013
Multi-user-systems und multi-gpu-usage CUDA Programming and Performance	9	6233	July 15, 2008
manage jobs in multi-gpu system with compute exclusive mode or not CUDA Programming and Performance	14	4083	September 3, 2010
Cuda + Torque + Maui? how to use queueing system with GPUs? CUDA Programming and Performance	7	21846	December 28, 2010
multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes CUDA Programming and Performance	39	9211	July 1, 2010

Exclusive Mode and More CPUs Than GPUs Can I overschedule GPUs in exclusive mode?

Related topics