Hello,
I am running computational fluid dynamics software PyFR on CUDA backend. My cluster has 8 A100 GPUs and two CPUs each node. What I want to do is to assign each CPU core with its own GPU to reach the best performance. After some googling, here is the procedure to do it in one node:
- run nvidia-smi topo -m to get topology of my machine which is like this:
ESC[4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA AffinityESC[0m
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
mlx5_3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
mlx5_6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
mlx5_7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
- after getting CPUs I wrote a script called script_gpu with taskset
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
case $CUDA_VISIBLE_DEVICES in
0)
CORES=48-63
;;
1)
CORES=48-63
;;
2)
CORES=16-31
;;
3) CORES=16-31
;;
4) CORES=112-127
;;
5) CORES=112-127
;;
6) CORES=80-95
;;
7) CORES=80-95
;;
esac
#echo $CUDA_VISIBLE_DEVICES $CORES
taskset -c $CORES $@
- run my job with mpi
mpirun -np 8 ./script_gpu pyfr ..........
This is working good. However, I need to run this with more nodes (ie two nodes). I changed my command to srun -n 16 --mpi=pmix --cpunobind=none ./script_gpu pyfr ....
but this is not working. I tried srun without script which is working but not working with best performance. Advised by some discussions online I tried numactl rather than taskset but also not working. Here is the record of what I have tried so far:
command errors
mpirun -np 8 with taskset -c $CORES $@ working
mpirun -np 8 with numactl --physcpubind=$CORES $@ libnuma: Warning: cpu argument 48-63 is out of range
mpirun -np 8 with numactl --cpunodebind=$CORES $@ libnuma: Warning: node argument 48 is out of range
srun -n 8 with --cpu-bind=none --mpi=pmix with taskset -c $CORES $@ taskset: failed to parse CPU list: pyfr
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --physcpubind=$CORES $@ sched_setaffinity: Invalid argument
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --cpunodebind=$CORES $@ numa_sched_setaffinity_v2_int() failed: Invalid argument sched_setaffinity: Invalid argument
Does anyone have comments on this? I am a student in fluid dynamics so I am not very familiar with these. Can anyone give me some detailed explanation? Really appreciate any answer.
Best wishes,
Zhenyang