I have multiple GPUs on a node in my cluster and am trying to run some benchmarks on the system. However, since my department is in research and has a job system set up, I can only take one GPU offline to test until I are sure I know what I are doing. My problem is trying to set up the mpirun command to specify a single GPU. I understand how to use mpirun to run programs on specific processors on different nodes, but the GPUs show up as devices and not processors. How do I set up my mpirun command to use a specific GPU, or is that even possible?
mpirun doesn’t handle allocating GPUs. You can define it as a resource though and schedule based on that. The GPU itself is chosen from within the application and mpi doesn’t have any control over that. What is usually done is to set the GPUs in exclusive mode and not choose the GPU explicitly from within the application. That way you get the first free GPU. If you want to pick a specific GPU you can allocate one instance per node and choose the GPU explicitly from there (again, setting it in exclusive mode and getting the first available one is the preferred way)
I’m trying to run the HPL benchmark in a system with CPUs and GPUs, so is that specified in the HPL installations?
If you are using the NVIDIA code, it does an internal assignment so that each MPI rank is using a different GPU.
You just need to use a number of MPI processes equal to the number of GPUs you want to use.
So is it possible to choose a specific GPU that gets used? I have 6 right now, but the first 4 do scheduling for CPUs, so I would like to specify the 5th and 6th only until I figure out how to use HPL with GPUs and CPUs.
You could use the CUDA_VISIBLE_DEVICES variable:
Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES environment
variable. Visible devices should be included as a comma-separated list in
terms of the system-wide list of devices. For example, to use only devices 0
and 2 from the system-wide list of devices, set CUDA_VISIBLE_DEVICES equal to
“0,2” before launching the application. The application will then enumerate
these devices as device 0 and device 1.