[HPC-Benchmarks 21.4] libnuma error

vitduck · October 29, 2021, 5:57am

Hello everyone,

This is my first time posting here. I would appreciate it if you could provide some insight to the following issue.
I’m encountering a peculiar error related to numactl with the NVIDIA HPC-Benchmarks 21.4.

[System Information]
OS: CentOS Linux release 7.9.2009
CPU: AMD EPYC 7742 (64-core)
GPU: 8 x A100-SMX4 (HGX)
Driver: 495.29.05

[CPU-GPU Connection Tology]

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     48-63   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     48-63   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     16-31   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     16-31   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PXB     PXB     112-127 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PXB     PXB     112-127 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     80-95   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     80-95   5
mlx5_0  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      PIX
mlx5_1  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     PIX      X

[Step to reproduce]

mpirun \
    -np 8 \
    --mca btl ^openib \
    singularity \
       run \
          -B /home/moasys/BMT/test/02-HPL_NVIDIA/nodes-01/:/mnt \
          --nv ../../../images/hpc-benchmarks\:20.10-hpl.sif \
          hpl.sh \
             --dat /mnt/HPL.dat \
             --cpu-affinity 3:3:1:1:7:7:5:5 \
             --gpu-affinity 0:1:2:3:4:5:6:7 \
             --cpu-cores-per-rank 8 > HPL.out

[Error message]

<1> is invalid
libnuma: Warning: cpu argument 1 is out of range

Therefore I can only run the benchmark with 1 A100 GPU.

The CPU-GPU affinity is somewhat awkward since it may be better to assign to one GPU per NUMA domain.
But this problem was not observed with 20.10 release whence monitoring with htop confirmed that MPI processes had been correctly created in their designated NUMA domains.

Could you give some suggestions on how to solve this problem ?

Regards.

ferrao · December 9, 2021, 4:44am

Did you ever figured it out? I’m with the same issue.

vitduck · December 9, 2021, 7:15am

What I did was replacing the numaclt command from 21.4 with ones from 20.4.

[Before]

info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} net=${UCX_NET_DEVICES} bin=$XHPL"

numactl --physcpubind=${CPU} ${MEMBIND} ${XHPL} ${DAT}

[After]

if [ -z "${MEM}" ]; then
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --cpunodebind=${CPU} ${XHPL} ${DAT}
else
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --physcpubind=${CPU} --membind=${MEM} ${XHPL} ${DAT}

Please dump the content on the 21.4 image to a folder and modify the hpl.sh script then finally rebuild it.
I guess there might be an issue with memory binding. But I didn’t look to deep.
I hope it helps.

ferrao · December 10, 2021, 5:58am

Thank @vitduck it helped A LOT! I was able to properly run the benchmark after extracting the files from the container and adapting hpl.sh file with your settings.

vitduck · December 11, 2021, 8:59am

I am glad that it has helped your case.
Regards.

ce107 · February 10, 2022, 6:48am

I ran into the same error - in more ways than one. On an Inspur system with Cascade Lake CPUs and a more regular NUMA structure the scripts work with ranges of cpus being provided but I need to use the + in front of every cpu range for relative numbering otherwise the script complains. On an HGX system like this the same approach would complain about some of the cpu ranges provided but not for all! Moreover if I changed the numbers of MPI processes the problematic cpu ranges would also change. Use of relative (+) or direct ranges did not save the day but adding or subtracting a + sign for some of the ranges would move the error message to another range. Very nonsensical. Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups. If one does not use membind then a solution with actual core ranges can be devised with taskset.

if [ -z "${MEM}" ]; then
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  taskset -c ${CPU} ${XHPL} ${DAT}
else
  info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"

  numactl --cpunodebind=${CPU} --membind=${MEM} ${XHPL} ${DAT}
fi

Still I’d like a solution that allows me to specify cpu ranges and membind instead of relying on first touch policies.

Notice that for the taskset case one would use something like --cpu-affinity 16-23:24-31:48-55:56-63:80-87:88-95:112-119:120-127 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5 while for the numactl case --cpu-affinity 1:1:3:3:5:5:7:7 --mem-affinity 1:1:3:3:5:5:7:7 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5

vitduck · February 11, 2022, 6:29am

@ce107

I had same observation with Cascade Lake as well.
Could you elaborate on this assessment ?

Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups

The original 21.4 script’s --mem-bind, caused the index error. It seems that we cannot mix a cpu-range based --physcubind and a numa-index based --mem-bind, at least with EPYC. I will adopt your approach since it allows better memory management.
I checked process placements during execution and they indeed were allocated in the correct NUMA mode. As long as there is 1:1 ratio between CPU and GPU, there should be no issue with over-subscription.
Am I understand your concern correctly ?

Would it be simpler just replacing the original with just the ‘else’ clause, i.e:

numactl --cpunodebind=${CPU} ${MEMBIND} ${XHPL} ${DAT}

Topic		Replies	Views
Run HPL on 4x A100 CUDA Programming and Performance	3	3084	July 17, 2021
Run HPL benckmark 23.3 on A800(80GB) GPU-Accelerated Libraries cuda	0	1199	April 20, 2023
CUDA accelerated Linpack seemingly not using any GPU CUDA Programming and Performance	18	3676	March 26, 2018
LinPack HPL to benchmark NVIDIA GPUs CUDA Programming and Performance	18	16363	March 8, 2018
HPL CUDA Programming and Performance	11	42400	July 18, 2011
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2540	March 11, 2024
Problem with read access violation for large arrays in unified memory CUDA Programming and Performance	16	3071	July 16, 2018
Heterogeneous Memory Support (HMM) in NVIDIA UVM driver and Linux 4.14 Linux	8	4630	March 19, 2023
[nvbandwidth] Debug an Anomalous Host to Device Memory Bandwidth CUDA Programming and Performance	7	1033	November 30, 2023
Poor Memcpy Performance Copying To Pinned Memory On Host CUDA Programming and Performance	16	8015	April 2, 2014

[HPC-Benchmarks 21.4] libnuma error

Related topics