'nvidia-smi topo -m' revisited

Following https://forums.developer.nvidia.com/t/how-to-change-cpu-affinity-in-nvidia-smi-topo/190990 and https://stackoverflow.com/questions/55364149/understanding-nvidia-smi-topo-m-output (especially the awesome figure) I am trying to make sense of my output from ‘nvidia-smi topo -m’.

              GPU0      GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X             NV2        NV2      NV2         SYS     0-3,7-9,13-15            0
GPU1    NV2           X           NV2      NV2         SYS     0-3,7-9,13-15            0
GPU2    NV2         NV2          X         NV2      NODE    24-27,31-33               2
GPU3    NV2         NV2        NV2        X         NODE    24-27,31-33               2
mlx5_0  SYS        SYS       NODE    NODE      X 

This is the output from one of our Volta nodes.
I understand that this is 4 GPUs connected by NVLink across 2 NUMA nodes.
It is the CPU Affinity column I am trying to get to grips with.
In previous years I had a script passing this output to assign CPU “controllers” for each GPU (which I guess I can still do) but topology seemed more intuitive in those days which CPU was closest to the GPU, because it was consecutive or numerically strided. The above CPU affinity column feels unintuitive, especially as the node has 48 CPUs.
Can you explain that smi output and advise on the best choice of matching the CPU to GPU where the CPU is only acting as controller and the remaining CPUs are doing other tasks in the background?

If you have a process that is going to access GPU0 or GPU1, then use something like:

taskset -c x  ./my_executable

where x is one of 0,1,2,3, 7,8,9, 13, 14,15, to place the execution of my_executable in a CPU core that has an “affinity relationship” to GPUs 0 and 1. That’s pretty much all you need to know for basic process placement.

For additional observations:

This system probably has 2 numa nodes per socket (if it is a 2-socket system) or 4 numa nodes per socket (if it is a 1-socket system). AMD CPUs are often configured like this. They will have PCIE lanes (connected to the GPUs) that are “closer” to CPU cores that are associated with particular numa nodes. What this means is that some CPU cores don’t have an affinity relationship to any GPU, thus, they don’t appear in the list. Thus your list may not include all 48 cores.

I can’t explain the core numbering exactly. That’s all I would bother to say/guess at without the CPU information (OEM system type, number of sockets, actual part numbers, etc.) If you want best insight into what’s going on, it’s important to have this info as well. The nvidia-smi topo -m output doesn’t contain all information that might be interesting.

1 Like

Wow, thanks for the speedy reply.

Normal priority queue, nodes equipped with NVIDIA Volta GPUs, 160 nodes total
2 x 24-core Intel Xeon Platinum 8268 (Cascade Lake) 2.9 GHz CPUs per node
384 GB RAM per node
2 CPU sockets per node, each with 2 NUMA nodes
12 CPU cores per NUMA node
96 GB local RAM per NUMA node
4 x Nvidia Tesla Volta V100-SXM2-32GB per node
480 GB local SSD disk per node 
Max request of 960 CPU cores (80 GPUs)

but I think you have answered all I was puzzled about. To give you context, I’m working with Gaussian (Kyle J can vouch for me) and nvidia-smi is still part of the instructions https://gaussian.com/gpu/. I probably still have the information I need. I just understand the other core numbering stuff better now so thank you.

Just for reference, the NUMA topology you’re seeing here is because sub-NUMA clustering is enabled. At least on this hardware, I’m pretty sure that makes the platform expose the internal topology of the cores within the physical CPU package to the OS, hence the multiple discontinuous ranges. Each of the CPU packages presents its own, potentially-unique arrangement.

However, the output of nvidia-smi can be a bit misleading: here’s the topology of an example node:

[bjm900@gadi-gpu-v100-0001 ~]$ lscpu | grep NUMA
NUMA node(s):        4
NUMA node0 CPU(s):   0-3,7-9,13-15,19,20,48-51,55-57,61-63,67,68
NUMA node1 CPU(s):   4-6,10-12,16-18,21-23,52-54,58-60,64-66,69-71
NUMA node2 CPU(s):   24-27,31-33,37-39,43,44,72-75,79-81,85-87,91,92
NUMA node3 CPU(s):   28-30,34-36,40-42,45-47,76-78,82-84,88-90,93-95

(HyperThreading is also enabled here, so ignore “cores” 48-95), but here’s the output of nvidia-smi:

[bjm900@gadi-gpu-v100-0001 ~]$ nvidia-smi topo -m | head -n6
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV2     NV2     NV2     SYS     0-3,7-9,13-15   0
GPU1    NV2      X      NV2     NV2     SYS     0-3,7-9,13-15   0
GPU2    NV2     NV2      X      NV2     NODE    24-27,31-33     2
GPU3    NV2     NV2     NV2      X      NODE    24-27,31-33     2
mlx5_0  SYS     SYS     NODE    NODE	 X

The CPU affinity column looks to have been silently truncated. But I’m guessing that’s just a limitation on the output?

Lol @ben.menadue you kind of said that on our Slack but I didn’t clock what you meant by sub-NUMA clustering and didn’t realise how it explained the CPU numbering. Not bothered about the hyperthreading. It was not being sure where the other 32 cpus had got to and what they were doing amongst other things. lscpu is a cool command (I was catting cpuinfo). Anyway, as Robert reinforced what you said I’m happy. Thank you both.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.