Assigning each CPU with a GPU

Hello,

I am running computational fluid dynamics software PyFR on CUDA backend. My cluster has 8 A100 GPUs and two CPUs each node. What I want to do is to assign each CPU core with its own GPU to reach the best performance. After some googling, here is the procedure to do it in one node:

  1. run nvidia-smi topo -m to get topology of my machine which is like this:
        ESC[4mGPU0      GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA AffinityESC[0m
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
mlx5_0  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_2  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_3  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS             
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS             
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS             
mlx5_6  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS             
mlx5_7  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS             
mlx5_8  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX             
mlx5_9  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X              
  1. after getting CPUs I wrote a script called script_gpu with taskset
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK

case $CUDA_VISIBLE_DEVICES in

        0)
                CORES=48-63
                ;;
        1)
                CORES=48-63
                ;;
        2) 
                CORES=16-31
                ;;
        3)      CORES=16-31
                ;;      
        4)      CORES=112-127
                ;;
        5)      CORES=112-127
                ;;
        6)      CORES=80-95
                ;;
        7)      CORES=80-95
                ;;
esac

#echo $CUDA_VISIBLE_DEVICES $CORES
taskset -c $CORES $@    
  1. run my job with mpi
mpirun -np 8 ./script_gpu pyfr ..........

This is working good. However, I need to run this with more nodes (ie two nodes). I changed my command to srun -n 16 --mpi=pmix --cpunobind=none ./script_gpu pyfr .... but this is not working. I tried srun without script which is working but not working with best performance. Advised by some discussions online I tried numactl rather than taskset but also not working. Here is the record of what I have tried so far:

command                                                                                                      errors
mpirun -np 8 with taskset -c $CORES $@                                           working        
mpirun -np 8 with numactl --physcpubind=$CORES $@                         libnuma: Warning: cpu argument 48-63 is out of range
mpirun -np 8 with numactl --cpunodebind=$CORES $@                         libnuma: Warning: node argument 48 is out of range 

srun -n 8 with --cpu-bind=none --mpi=pmix with taskset -c $CORES $@       taskset: failed to parse CPU list: pyfr
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --physcpubind=$CORES $@       sched_setaffinity: Invalid argument
srun -n 8 with --cpu-bind=none --mpi=pmix with numactl --cpunodebind=$CORES $@       numa_sched_setaffinity_v2_int() failed: Invalid argument       sched_setaffinity: Invalid argument

Does anyone have comments on this? I am a student in fluid dynamics so I am not very familiar with these. Can anyone give me some detailed explanation? Really appreciate any answer.

Best wishes,
Zhenyang