Unable to run on more than 1 GPU

I am trying to run NVidia Modulus with FPGA laminar flow example and a personally created simulation. Both will run and provide results, but I am unable to get either to use more than 1 GPU at a time.

Launching using SLURM srun.
Using NVidia Modulus container, converted to Singularity.
Tested on nodes with T4 gpus and separately on nodes with V100 gpus.
I can run python sessions, import torch, and the device count will match however many GPUs I requested using srun. (inside and outside the container)
nvidia-smi also shows correct gpu count/types. (inside and outside the container)

I had to modify the srun command provided in the modulus documentation (Performance section), to set correct number of gpus.

srun --gres=gpu:t4:4 --cpus-per-gpu=8 singularity exec --nv -B /data:/data ./data/modulus.20.09.sif python ./data/MyProgramOrFPGAexample.py

I’ve attempted to specify --mpi argument as none, pmi2, and pmix. None of them change the gpu usage.

I’ve attempted to include mpirun (-np 4), but adding it prior to the singularity command just launches that many containers. Adding it as an argument to run in the container returns that the specified number of resources are not available, which seems odd considering nvidia-smi and torch show that the gpus are available.

Any help is appreciated!

@patterson Looks like you don’t have the -n option in your srun command. This defaults to running a single task: Slurm Workload Manager - srun.

That’s probably why you’re also not able to run with mpirun too since the hostfile probably only has one slot based on the SLURM command.

If you instead run with srun -n 4 ..., that will launch 4 tasks and each task will target one GPU. Ensure that the following environment variables are set appropriately inside the container so that Modulus and PyTorch can pick those up to set up the distributed job: SLURM_PROCID, SLURM_NPROCS, SLURM_LOCALID and SLURM_LAUNCH_NODE_IPADDR.

Thanks for the suggestions, they pointed us in the right direction. Single node success with mpirun threw us off a bit, but we traced it back to the SLURM_ variables not showing up correctly inside the containers.

Perhaps it is just our Slurm configuration, but to get the SLURM_ environment variables to show up correctly we need to allocate using an sbatch command and file:

sbatch slurmGPU.sbatch

where the slurmGPU.sbatch file contents are:

#!/bin/bash
#slurmGPU.sbatch

#SBATCH --job-name=Modulus 
#SBATCH --gpus=8  
#SBATCH --cpus-per-gpu=2
#SBATCH --output=sbatchOutput.txt

srun -n 8 singularity exec --nv -B /data:/data ./data/modulus.20.09.sif python ./data/mySimulationFile.py 

which actually uses srun to launch with the appropriate number of gpus.

Hopefully this can help others. Thanks again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.