ConnectX-7 NIC in DGX Spark

I’d like to ask about the port configuration of the ConnectX-7 NIC in DGX Spark.

On our DGX Spark, the onboard ConnectX-7 NIC appears as four ports when checking with ip -br a and lspci:

roceP2p1s0f0 port 1  ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1  ==> enP2p1s0f1np1 (Up)
rocep1s0f0   port 1  ==> enp1s0f0np0    (Down)
rocep1s0f1   port 1  ==> enp1s0f1np1    (Up)

However, the chassis only has two physical ports.

If anyone is familiar with this behavior, could you help explain how the NIC port mapping/logical configuration works for ConnectX-7 on DGX Spark, or point me to any official documentation or technical references that describe this layout?

This is the expected behaviour due to a limitation in the GB10 chip.

The SoC can’t provide more than x4-wide PCIe per device, so, in order to achieve the 200gbps speed, we had to use the Cx7’s multi-host mode, aggregating 2 separate x4-wide PCIe links, which combined can deliver the 200gbps speed.

As a consequence, the interfaces show 4 times, because each root port has to access both interface ports through a x4 link. For maximum speed, you can aggregate all ports, or for a single cable, aggregate enp1s0f0np0 with enP2p1s0f0np0 for instance, using balance-XOR (mode2).

More information on how to aggregate ports is available here:
NVIDIA Enterprise Support Portal | How to Configure RoCE over LAG (ConnectX-4/ConnectX-5-/ConnectX-6)

1 Like

Is this documentation up to date? Connect Two Sparks | DGX Spark

The mapping is easier to see by listing /sys/class/net:

enp1s0f0np0 → ../../devices/pci0000:00/0000:00:00.0/0000:01:00.0/net/enp1s0f0np0
enp1s0f1np1 → ../../devices/pci0000:00/0000:00:00.0/0000:01:00.1/net/enp1s0f1np1
enP2p1s0f0np0 → ../../devices/pci0002:00/0002:00:00.0/0002:01:00.0/net/enP2p1s0f0np0
enP2p1s0f1np1 → ../../devices/pci0002:00/0002:00:00.0/0002:01:00.1/net/enP2p1s0f1np1

The interface with a P2 in the label is just on a different bus, PCIe bus 2 in this case.

1 Like

Has anyone been able to get it to work
I’ve gone through the automatic, manual and I’m not connected and still getting errors.
This might be one for a video walk through
If we’re wasting this much time on basics using the Sparks for training becomes an inefficient time drain

If you started with option 1 (auto connect) make sure you disable the address if it happens to be a different one from the manual set up. After that manual set up or the second step-by-step works.

This worked for me. I tried Option 1 Automatically configure SSH. As described in the Connect Two Sparks playbook.

Hi everyone, my “two spark” kit with it’s just arrived from our partner, and I start to play with it.
ethtools confirms that links with DAC is 200G, despite i’m getting just 98.2G max when running iperf3.
Since iperf it’s a kind powerfull tool, but comes with tons of parameter and tuning option, probably I’m just missing something, so I move over.
But when I try the “NCCL for Two Sparks“ example, I got this results:

So i just want to check if it’s normal results or there is something in my 2 Kit setup.

1 Like

Can we get some more clarification on this? Is 200Gbps achievable via a single port/single cable connection? If so, based on your post, we have to aggregate enp1s0f0np0 with enP2p1s0f0np0, even though it’s one physical port?

It would be good if you updated documentation and this playbook for clarity, because it says 200 Gbps everywhere without mentioning aggregate or addressing the port/PCIe mapping.

You’ll need to aggregate two of the 100G halves to achieve 200G with a single cable, even though they are a single physical port.

The playbook mentions that two interfaces are displayed for each physical link:

“interface showing as ‘Up’ is enp1s0f1np1 / enP2p1s0f1np1 (each physical port has two names).”

Thanks for the clarification! Can someone update the playbook please?
Because right after this it says:

Please disregard enP2p1s0f0np0 and enP2p1s0f1np1, and use enp1s0f0np0 and enp1s0f1np1 only.

Please clarify this more.

  1. Which interface is the “root port”, enP2p1s0f0np0? If it accesses “both interface ports”, and there are only two inteface ports, why the existence of enP2p1s0f1np1? Just one en2P… should suffice.
  2. “aggregate all ports” means creating, say, bond0, with all 4 interfaces, or just the two enP2…ones? Because XOR mode including two interface/physical ports says transmission is in XOR fashion (one port or the other, not both) and should not be different from using a single cable with one physical port on each Spark.

Thanks!

2 Likes

We’re working on that - you should see some updates to this playbook soon.

4 Likes

Any updates on that? Do we need to create bond0, or they are aggregated on the firmware level already?

Appears aggregated by default. Sampling with mlnx_perf on both enp1s0f0np0 and enP2p1s0f0np0 should show activity on both paths, even if enP2p1s0f0np0 has no IP.

1 Like

@mnagy009 there are only two NICs attached to the Spark, thus the two sockets. Each NIC is attached to a different PCIe root complex. Here’s lspci -t output:

-[0000:00]---00.0-[01-0f]--+-00.0
                           \-00.1
-[0002:00]---00.0-[01-0f]--+-00.0
                           \-00.1

First NIC is attached to domain 0000 and the second NIC to domain 0002. When you plugin in the cable both ports of each NIC will show link UP but you set IP on one port only for networking.

The first NIC is the left socket, the one next to the 10G port. And both NICs share the 200G bandwidth.

The PCIe mapping might be overwhelming if you’re used to the notebook/desktop settings only.

Each NIC is dual-port:

NIC 1 = 0000:01:00.[0–1]

NIC 2 = 0002:01:00.[0–1]

1 Like

Has anyone connected the two Sparks with 2 cables and are there any pros cons from that set up? My second cable just arrived from Naddod and I’m debating whether there’s any value setting two cables up.

Edit:
Following this rabbit hole to see if 400G is actually feasible with 2 cables

Current max achieved on 2 connected cables: ~26 GB/s (208 Gbps)
Single Cable:
~200 Gbps (25 GB/s) theoretical
So the two cables hit theoretical max

Testing what would happen if I enable GPU Direct RDMA aiming for potential 2x performance

I’ve got similar numbers ( 195Gbps over two links , - busbw in my case is slightly higher ~ 25GB/s )

   536870912     134217728     float    none       0  21982.6   24.42   24.42       0  23507.8   22.84   22.84       0

  1073741824     268435456     float    none       0  43953.5   24.43   24.43       0  43945.0   24.43   24.43       0

  2147483648     536870912     float    none       0  87839.0   24.45   24.45       0  88737.6   24.20   24.20       0

  4294967296    1073741824     float    none       0   175583   24.46   24.46       0   176853   24.29   24.29       0

Note, that GPU utilization during collective is 100% ( 96%, but 4% in idle ) so may be it the maximum we can get ?

Can you clarify 26GB/s ? On your screenshot its 12.89GB/s

1 Like

Actually, you are getting 100 GBps using both cables - you need to look at busbw. To get 200 GBps with two cables you would likely need to create a bond first.

Despite your optimistic LLM telling you what you want to hear, you won’t be able to achieve 400G using two ports - when both ports are active, each will get 100G. DGX Spark doesn’t have enough PCI lanes to achieve 400G, and even to achieve 200G on a single port, they had to do some voodoo with dual NIC and bonding due to PCIe limitations on this arch.

Oh, and GPU Direct RDMA is not implemented on Spark according to NVIDIA - there were some posts from NVIDIA explaining why, and I believe it is buried somewhere in the documentation too.

I guess, my point is that DGX Spark is a very new platform, so LLMs will not know much about it => risk of hallucinations is super high.