Following https://forums.developer.nvidia.com/t/how-to-change-cpu-affinity-in-nvidia-smi-topo/190990 and https://stackoverflow.com/questions/55364149/understanding-nvidia-smi-topo-m-output (especially the awesome figure) I am trying to make sense of my output from ‘nvidia-smi topo -m’.
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV2 NV2 SYS 0-3,7-9,13-15 0
GPU1 NV2 X NV2 NV2 SYS 0-3,7-9,13-15 0
GPU2 NV2 NV2 X NV2 NODE 24-27,31-33 2
GPU3 NV2 NV2 NV2 X NODE 24-27,31-33 2
mlx5_0 SYS SYS NODE NODE X
This is the output from one of our Volta nodes.
I understand that this is 4 GPUs connected by NVLink across 2 NUMA nodes.
It is the CPU Affinity column I am trying to get to grips with.
In previous years I had a script passing this output to assign CPU “controllers” for each GPU (which I guess I can still do) but topology seemed more intuitive in those days which CPU was closest to the GPU, because it was consecutive or numerically strided. The above CPU affinity column feels unintuitive, especially as the node has 48 CPUs.
Can you explain that smi output and advise on the best choice of matching the CPU to GPU where the CPU is only acting as controller and the remaining CPUs are doing other tasks in the background?