[HPC-Benchmarks 21.4] libnuma error

Hello everyone,

This is my first time posting here. I would appreciate it if you could provide some insight to the following issue.
I’m encountering a peculiar error related to numactl with the NVIDIA HPC-Benchmarks 21.4.

[System Information]
OS: CentOS Linux release 7.9.2009
CPU: AMD EPYC 7742 (64-core)
GPU: 8 x A100-SMX4 (HGX)
Driver: 495.29.05

[CPU-GPU Connection Tology]

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     48-63   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     48-63   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     16-31   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     16-31   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PXB     PXB     112-127 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PXB     PXB     112-127 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     80-95   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     80-95   5
mlx5_0  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      PIX
mlx5_1  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     PIX      X

[Step to reproduce]

mpirun \
    -np 8 \
    --mca btl ^openib \
    singularity \
       run \
          -B /home/moasys/BMT/test/02-HPL_NVIDIA/nodes-01/:/mnt \
          --nv ../../../images/hpc-benchmarks\:20.10-hpl.sif \
          hpl.sh \
             --dat /mnt/HPL.dat \
             --cpu-affinity 3:3:1:1:7:7:5:5 \
             --gpu-affinity 0:1:2:3:4:5:6:7 \
             --cpu-cores-per-rank 8 > HPL.out

[Error message]

<1> is invalid
libnuma: Warning: cpu argument 1 is out of range

Therefore I can only run the benchmark with 1 A100 GPU.

The CPU-GPU affinity is somewhat awkward since it may be better to assign to one GPU per NUMA domain.
But this problem was not observed with 20.10 release whence monitoring with htop confirmed that MPI processes had been correctly created in their designated NUMA domains.

Could you give some suggestions on how to solve this problem ?

Regards.

1 Like

Did you ever figured it out? I’m with the same issue.

What I did was replacing the numaclt command from 21.4 with ones from 20.4.

[Before]

info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} net=${UCX_NET_DEVICES} bin=$XHPL"

numactl --physcpubind=${CPU} ${MEMBIND} ${XHPL} ${DAT}

[After]

if [ -z "${MEM}" ]; then
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --cpunodebind=${CPU} ${XHPL} ${DAT}
else
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --physcpubind=${CPU} --membind=${MEM} ${XHPL} ${DAT}

Please dump the content on the 21.4 image to a folder and modify the hpl.sh script then finally rebuild it.
I guess there might be an issue with memory binding. But I didn’t look to deep.
I hope it helps.

2 Likes

Thank @vitduck it helped A LOT! I was able to properly run the benchmark after extracting the files from the container and adapting hpl.sh file with your settings.

I am glad that it has helped your case.
Regards.

I ran into the same error - in more ways than one. On an Inspur system with Cascade Lake CPUs and a more regular NUMA structure the scripts work with ranges of cpus being provided but I need to use the + in front of every cpu range for relative numbering otherwise the script complains. On an HGX system like this the same approach would complain about some of the cpu ranges provided but not for all! Moreover if I changed the numbers of MPI processes the problematic cpu ranges would also change. Use of relative (+) or direct ranges did not save the day but adding or subtracting a + sign for some of the ranges would move the error message to another range. Very nonsensical. Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups. If one does not use membind then a solution with actual core ranges can be devised with taskset.

if [ -z "${MEM}" ]; then
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  taskset -c ${CPU} ${XHPL} ${DAT}
else
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --cpunodebind=${CPU} --membind=${MEM} ${XHPL} ${DAT}
fi

Still I’d like a solution that allows me to specify cpu ranges and membind instead of relying on first touch policies.

Notice that for the taskset case one would use something like --cpu-affinity 16-23:24-31:48-55:56-63:80-87:88-95:112-119:120-127 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5 while for the numactl case --cpu-affinity 1:1:3:3:5:5:7:7 --mem-affinity 1:1:3:3:5:5:7:7 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5

@ce107

I had same observation with Cascade Lake as well.
Could you elaborate on this assessment ?

Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups

The original 21.4 script’s --mem-bind, caused the index error. It seems that we cannot mix a cpu-range based --physcubind and a numa-index based --mem-bind, at least with EPYC. I will adopt your approach since it allows better memory management.
I checked process placements during execution and they indeed were allocated in the correct NUMA mode. As long as there is 1:1 ratio between CPU and GPU, there should be no issue with over-subscription.
Am I understand your concern correctly ?

Would it be simpler just replacing the original with just the ‘else’ clause, i.e:

numactl --cpunodebind=${CPU} ${MEMBIND} ${XHPL} ${DAT}