MIG load balancing

Hi, I am fairly new to CUDA and was wondering if I could check my understanding of using MIG-partitioned GPUs in an interactive multi-user system.

We recently deployed two new servers each with dual-A100 GPUs. These are to replace existing machines with 4xK80 GPUs. Benchmarking has suggested that that when fully partitioning an A100 into 7 MIG instances each instance is comparable to a K80 so we’re hoping for a significant increase in throughput capability.

The K80s were used interactively on a multi-user server, and we put them in ‘exclusive process’ mode. This meant that if one user starts a CUDA process requesting a single GPU it takes over that GPU and if a different process subsequently requests a GPU it will get a different one (assuming no more than 4 processes are using CUDA at once). So we never had to worry about users having to allocate GPUs to their processes.

With MIG, my understanding is that you cannot put them into exclusive process mode and the default behaviour is for any process requesting a GPU will simply be allocated the first MIG partition of the first GPU. Obviously this isn’t going to work for multiple users! The only solution seems to be for users to manually ‘check’ what GPUs are in use and then use CUDA_VISIBLE_DEVICES to allocate a suitable MIG instance. This seems a bit awkward, and we may also get problems with some third-party software which runs multiple CUDA processes in parallel - without modification I think this will not be able to distribute the processing over multiple MIG instances.

Is there any solution that I’m missing here? Right now the only things I can think of is a script to set CUDA_VISIBLE_DEVICES based on inspecting current running processes to identify ‘free’ MIG instances. Alternatively we could use a queuing system like Slurm but ideally we want to allow interactive use.

Thanks for any help anyone can offer - as said I’m no CUDA expert so it’s possible I’ve just misunderstood things or missed something obvious.

Bw,
Martin

using slurm is a good choice

slurm can provide an “interactive” experience

Only one MIG instance can be exposed per process. Therefore to distribute processing over multiple MIG instances, you would use a process-based distribution system like MPI and assign (via CUDA_VISIBLE_DEVICES) separate MIG instances to each MPI process/rank.

Great, thanks for confirming that I’m not missing something obvious. We’re going to use a script to identify ‘free’ GPU instances initially and consider using slurm if that isn’t sufficient.

Do you think there’s any likelihood that NVidia will improve this in the future, e.g. by enabling process exclusive mode on MIG partitions? It seems that at the moment MIG is only usable either by introducing a queueing system or by asking users to manually manage GPU instance allocation.

I haven’t heard of such changes, but at the same time, I might not know about them, and my role here is not really to discuss the future.