DGX A 100 with mellanox MQM8700 switches setup

Hi All,

We have a setup with 2 MQM8700 switches and 2-DGX-A 100 nodes.
Looking for a cabling guide between DGX and mellanox switches, mellanox switches configuration guide and DGX to be setup in cluster.
If anyone has the detailed document of it kindly share.

Hi @psatish69 !

I’d recommend following the core of one of the DGX POD Reference Architectures, as described by the papers you can click-through at the bottom of DGX POD: The Industry Standard for AI at Scale | NVIDIA .

Using NetApp ONTAP AI as an example, NetApp ONTAP AI with NVIDIA DGX A100 Systems illustrates how to build an InfiniBand compute fabric with the single-port HCAs on the DGX A100 and the pair of QM8700 switches. (It also has a lot of detail about the storage fabric, which you didn’t ask about, but is often equally important!).

ScottE

Thanks Scott.

Our setup is just 2 Mellanox switches with 2 DGX nodes.
User wants to have cluster setup across the DGX nodes with their applications running in containers.
Looking for documents on how to configure the switches and then what kind of solution will be best for cluster setup in DGX nodes(Kubernetes, Ansible)

Also is there any detailed document on how to configure mellanox switches, what kind of configuration to be done on the port level for DGX nodes, is there cluster setup?

There’s not really anything to be done on the switches for a configuration like you have. Connect the cables from the DGXs to the switches (e.g., even single-port ConnectX-6 ports to one switch, odd ports to the other switch). The switches should be automatically running a subnet manager, and things will “just work”.

Check out with ibnetdiscover (Diagnostic Tools - UFM-SDN Appliance UM v4.4.0 - NVIDIA Networking Docs ) if you can see all the ports on both DGX A100 systems after cabling them up.

ScottE