I installed the network drivers. Now what?

I am trying to setup Infiniband networking on our HGX GPU cluster, and have installed the doca-all package on two of the machines. I think the installation went fine, but I am struggling to figure out what to do next. How do I actually test whether Infiniband works?

For some additional info this is what I get when I do ifconfig on one of the machines.

ceti@ceti5:/opt/mellanox/doca$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:8b:e4:da:fa  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp83s0f0np0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether a0:88:c2:39:1c:90  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp83s0f1np1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether a0:88:c2:39:1c:91  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp86s0f0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 7c:c2:55:7b:43:74  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp86s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.69.75  netmask 255.255.255.0  broadcast 192.168.69.255
        inet6 fe80::7ec2:55ff:fe7b:4375  prefixlen 64  scopeid 0x20<link>
        ether 7c:c2:55:7b:43:75  txqueuelen 1000  (Ethernet)
        RX packets 97630  bytes 7988902 (7.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2528  bytes 269280 (269.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1014  bytes 279266 (279.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1014  bytes 279266 (279.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

When I ask Perplexity it suggests I do a ping, but that’d just end up going over the Ethernet port. I have no idea what these other interfaces are for, nor how to test Infiniband.

ceti@ceti5:/opt/mellanox/doca$ lspci | grep -i infi
19:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
29:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
3b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
5c:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
9b:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
aa:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
bb:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
da:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]

The Infiniband controllers are there.

ceti@ceti5:/opt/mellanox/doca$ mlxconfig q
-E- No devices found, mst might be stopped. You may need to run 'mst start' to load MST modules.
ceti@ceti5:/opt/mellanox/doca$ sudo mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded

MST devices:
------------
/dev/mst/mt4125_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:53:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf0         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:19:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf1         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:29:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf2         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:3b:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf3         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:5c:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf4         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:9b:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf5         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:aa:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf6         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:bb:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00
/dev/mst/mt4129_pciconf7         - PCI configuration cycles access.
                                   domain:bus:dev.fn=0000:da:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                   Chip revision is: 00

I already did sudo mst start and it didn’t help. Why is mlxconfig not finding any devices.

ceti@ceti5:/opt/mellanox/doca$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp86s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 7c:c2:55:7b:43:74 brd ff:ff:ff:ff:ff:ff
3: usb0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether fe:f8:a6:6d:9f:1e brd ff:ff:ff:ff:ff:ff
4: enp86s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:c2:55:7b:43:75 brd ff:ff:ff:ff:ff:ff
    inet 192.168.69.75/24 metric 100 brd 192.168.69.255 scope global dynamic enp86s0f1
       valid_lft 50079sec preferred_lft 50079sec
    inet6 fe80::7ec2:55ff:fe7b:4375/64 scope link
       valid_lft forever preferred_lft forever
15: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:8b:e4:da:fa brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
16: enp83s0f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a0:88:c2:39:1c:90 brd ff:ff:ff:ff:ff:ff
17: enp83s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether a0:88:c2:39:1c:91 brd ff:ff:ff:ff:ff:ff
18: ibp25s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:0c:a1:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ee:fe:84 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
19: ibp41s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ed:42:d6 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
20: ibp59s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ed:42:fe brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
21: ibp92s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ee:f5:34 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
22: ibp155s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ec:f6:3e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
23: ibp170s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ed:63:06 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
24: ibp187s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ec:f7:7e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
25: ibp218s0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ed:3d:d6 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

I think I need to configure all of the Infiniband so they have IP addresses and then bring them up, right? But wouldn’t it be awkward to have to assign an IP to each of these? Would it be possible to assign a single IP to all the interfaces?