What I did was replacing the numaclt
command from 21.4 with ones from 20.4.
[Before]
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} net=${UCX_NET_DEVICES} bin=$XHPL"
numactl --physcpubind=${CPU} ${MEMBIND} ${XHPL} ${DAT}
[After]
if [ -z "${MEM}" ]; then
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} ucx=${UCX_NET_DEVICES} bin=$XHPL"
numactl --cpunodebind=${CPU} ${XHPL} ${DAT}
else
info "host=$(hostname) rank=${RANK} lrank=${LOCAL_RANK} cores=${CPU_CORES_PER_RANK} gpu=${GPU} cpu=${CPU} mem=${MEM} ucx=${UCX_NET_DEVICES} bin=$XHPL"
numactl --physcpubind=${CPU} --membind=${MEM} ${XHPL} ${DAT}
Please dump the content on the 21.4 image to a folder and modify the hpl.sh
script then finally rebuild it.
I guess there might be an issue with memory binding. But I didn’t look to deep.
I hope it helps.