This is my first time posting here. I would appreciate it if you could provide some insight to the following issue.
I’m encountering a peculiar error related to numactl with the NVIDIA HPC-Benchmarks 21.4.
[System Information]
OS: CentOS Linux release 7.9.2009
CPU: AMD EPYC 7742 (64-core)
GPU: 8 x A100-SMX4 (HGX)
Driver: 495.29.05
<1> is invalid
libnuma: Warning: cpu argument 1 is out of range
Therefore I can only run the benchmark with 1 A100 GPU.
The CPU-GPU affinity is somewhat awkward since it may be better to assign to one GPU per NUMA domain.
But this problem was not observed with 20.10 release whence monitoring with htop confirmed that MPI processes had been correctly created in their designated NUMA domains.
Could you give some suggestions on how to solve this problem ?
if [ -z "${MEM}" ]; then
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"
numactl --cpunodebind=${CPU} ${XHPL} ${DAT}
else
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"
numactl --physcpubind=${CPU} --membind=${MEM} ${XHPL} ${DAT}
Please dump the content on the 21.4 image to a folder and modify the hpl.sh script then finally rebuild it.
I guess there might be an issue with memory binding. But I didn’t look to deep.
I hope it helps.
Thank @vitduck it helped A LOT! I was able to properly run the benchmark after extracting the files from the container and adapting hpl.sh file with your settings.
I ran into the same error - in more ways than one. On an Inspur system with Cascade Lake CPUs and a more regular NUMA structure the scripts work with ranges of cpus being provided but I need to use the + in front of every cpu range for relative numbering otherwise the script complains. On an HGX system like this the same approach would complain about some of the cpu ranges provided but not for all! Moreover if I changed the numbers of MPI processes the problematic cpu ranges would also change. Use of relative (+) or direct ranges did not save the day but adding or subtracting a + sign for some of the ranges would move the error message to another range. Very nonsensical. Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups. If one does not use membind then a solution with actual core ranges can be devised with taskset.
if [ -z "${MEM}" ]; then
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"
taskset -c ${CPU} ${XHPL} ${DAT}
else
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"
numactl --cpunodebind=${CPU} --membind=${MEM} ${XHPL} ${DAT}
fi
Still I’d like a solution that allows me to specify cpu ranges and membind instead of relying on first touch policies.
Notice that for the taskset case one would use something like --cpu-affinity 16-23:24-31:48-55:56-63:80-87:88-95:112-119:120-127 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5 while for the numactl case --cpu-affinity 1:1:3:3:5:5:7:7 --mem-affinity 1:1:3:3:5:5:7:7 --cpu-cores-per-rank 8 --gpu-affinity 2:3:0:1:6:7:4:5
I had same observation with Cascade Lake as well.
Could you elaborate on this assessment ?
Switching to -cpunodebind= from -physcpubind= in the 2nd case (with membind) and using NUMA domains for the CPU binding fixes problems but it substandard as it binds too many processes in the same socket instead of different core groups
The original 21.4 script’s --mem-bind, caused the index error. It seems that we cannot mix a cpu-range based --physcubind and a numa-index based --mem-bind, at least with EPYC. I will adopt your approach since it allows better memory management.
I checked process placements during execution and they indeed were allocated in the correct NUMA mode. As long as there is 1:1 ratio between CPU and GPU, there should be no issue with over-subscription.
Am I understand your concern correctly ?
Would it be simpler just replacing the original with just the ‘else’ clause, i.e: