Large virtual memory reserved and scheduling issues

Hi all,

I am not sure this is a good place to ask this but perhaps someone has experienced this before. I have a cuda-fortran based code that requires about 100 MB per MPI thread on the GPU (I am checking this with nvidia-smi during runtime). However, when I look at the virtual memory via top or similar, I notice that 20+ GB are reserved per thread. My understanding is that this is due to unified memory although my understanding of this is spotty.

I can run this code fine when I log into a compute node with a single K80 GPU and run it with or without MPS with no issues using mpirun for up to 26 (with MPS) and 40 (without MPS) MPI threads.

Issues start when I am trying to schedule this from a head node using slurm. Due to the large reserved memory, slurm kills the jobs with out of memory errors.

Is there a way in the compiling of this code with mpifort/pgfortran to avoid having so much virtual memory reserved? I can’t seem to find a reasonable solution with slurm.

Thanks, Jan

Hi Jan,

However, when I look at the virtual memory via top or similar, I notice that 20+ GB are reserved per thread. My understanding is that this is due to unified memory although my understanding of this is spotty.

I believe you’re correct in that the CUDA driver reserves virtual memory matching the size of the CPU memory plus the total amount of all GPU memory.

Unfortunately, I don’t see any CUDA documentation that shows how to control this behavior so I’m not sure what can be done about it. Let me do some research and see what I can find.

-Mat

Hi Jan,

I queried a few folks at NVIDIA for suggestions. Unfortunately with K80s the virtual memory usage is inherent to how CUDA unified memory works so can’t be changed.

Are you able work with your site admins to see if you can increase SLURM’s memory limits (/etc/security/limits.conf)?

Note that later NVIDIA GPUs such as Pascal do not need to reserve the virtual memory so do not have this issue.

-Mat

Here’s one of the responses that I got back:

It sounds like Slurm is misconfigured, either on the user-side or the configuration side.

Using --mem, or --mem-per-cpu for the job launch (srun/salloc/sbatch) may alleviate the issue.

–mem=MB minimum amount of real memory [per node]
–mem-per-cpu=MB maximum amount of real memory per allocated
cpu required by the job.
–mem >= --mem-per-cpu if --mem is specified.

It’s also possible the limits are not properly set on the node in /etc/security/limits.conf [Ubuntu]. Slurm suggests memlock and stack are set to unlimited:

  • soft memlock unlimited
  • hard memlock unlimited
  • soft stack unlimited
  • hard stack unlimited

Hi Mat,

thanks for looking into this. My slurm script is specified as:

#!/bin/bash
#SBATCH --time=100:00:00
#SBATCH -N 1
#SBATCH -n 4 
#SBATCH --mem=32000
#SBATCH --mem-per-cpu=1000
#SBATCH --gres=gpu:2

module load cuda
srun --mpi=pmi2 prjmh_temper_cuda_buck > ./out.log

I think that is an allowable configuration. I was not aware of the limits.conf but have since set it to the recommended values in your post (on all nodes in the cluster).

Finally, I have restarted the slurmd with these environment variables (as suggested in the slurm FAQ):
export SLURMD_OOM_ADJ=-17
export SLURMSTEPD_OOM_ADJ=-17

When I submit a job, the issue remains the same and the job gets killed with the message:

slurmstepd: Step 5713.0 exceeded virtual memory limit (83806492 > 29491200), being killed
slurmstepd: *** STEP 5713.0 CANCELLED AT 2018-02-07T11:42:54 *** on compute-0-3
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: Exceeded job memory limit
slurmstepd: Step 5713.0 exceeded virtual memory limit (83806492 > 29491200), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: Step 5713.0 exceeded virtual memory limit (83806492 > 29491200), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: Exceeded job memory limit
srun: got SIGCONT
slurmstepd: *** JOB 5713 CANCELLED AT 2018-02-07T11:42:54 *** on compute-0-3
srun: forcing job termination
srun: error: compute-0-3: task 0: Killed
srun: error: compute-0-3: tasks 1-3: Killed

I must have some other issue… I can’t find other people reporting this, so I suppose I have to reconsider the full slurm configuration.

Thanks, Jan