I apologize if this is the wrong subforum, it seemed to be one of the most likely at least…
Our HPC cluster (running slurm) was recently upgraded with a number of A100 cards, which we are now trying to get the most out of. That includes figuring out how to activate the ‘multiple instance GPU’ functionality. But, reading through NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation, it seems there is an assumption of users with sudo rights?
If the admin has enabled MIG on each GPU, is it then possible for the users in their jobscripts to ‘activate’ 7 MIG 1g.5gb profiles, and then assign CUDA jobs to each profile?
right now, the closest we can get is first running a job with ‘nvidia-smi -L’ on the node , getting device ID’s (they look like ‘MIG-GPU-09156ffa-eece-6481-ce94-42ac07f27aa4/7/0’) and then running the ‘real’ jobscript with lines like
CUDA_VISIBLE_DEVICES=MIG-GPU-09156ffa-eece-6481-ce94-42ac07f27aa4/7/0 “CUDA job” &
but this seems like a very cumbersome workflow?