Confusion surrounding the QSFP ports and bandwidth

Hey folks,

I have two DGX Sparks and two QSFP56 cables. I thought I could get 400Gbps/50GBps bandwidth, using the two cables. I saw some conflicting and confusing information in different forum posts and I’m now questioning the basics. I’m pretty new to this, so any help is appreciated.

  1. Does the Spark support two QSFP56 connections simultaneously, or just the one?
  2. What sort of bandwidth should I be expecting? I’m currently seeing around 13GB/s.
  3. Does anyone have any tips on maximizing bandwidth potential? I was hoping to take advantage of tensor parallelism and am now concerned about the bandwidth.

I appreciate you and your time.

There are two ConnectX port cages on the Spark. The total bandwidth of 200Gbps is per device. If both ports are being used the bandwidth is halved and each link will be 100Gbps. Use ethtool to check the current link speed.

I’m only using one cable to connect two Sparks back-to-back and the ethtool output is:

        Speed: 200000Mb/s
        Lanes: 4
        Duplex: Full
        Auto-negotiation: on
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal

I don’t have another QSFP56 cable to test my theory. If you’re using both links please post the Speed reported by ethtool for both ports.

You won’t reach 400Gbps, it’s ~200 Gbps.
If you just follow the original playbooks with a single cable (connect port 1 to port 1 or port 2 to port 2) you should see 24-25 GBps:

Before anything you should test with just 1 NIC on each Spark and 1 QSFP cable (connect the same port on each spark) and Give it /24 IPs on both Sparks, Verify ping works bidirectionally.

This instruction work fine for 2 cables, but it’s not really needed:

Please take a look at this thread if needed: ConnectX-7 NIC in DGX Spark

1 Like

For the life of me, I can’t get the second port to recognize the cable, today. I’ve spent about 2 hours rebooting and troubleshooting, trying to test this, but I think I’m going to have to give up. Sorry.

Thanks for confirming the port speeds and that the Spark is limited to 200Gbps, regardless of cable config. I’ll try to undo what I’ve done and run through the tutorial again. I appreciate you and your time.

I put a benchmark here which you can use to make sure you’re running properly on 200GbE and not somehow falling back to the 10GbE connection.

Did I understand linked articles that with 2 sparks 1 cable between them is all you need to get full bandwidth?

1 Like

For 2 sparks you only need 1 cable

1 Like

Yeah, I was able to achieve the full bandwidth with 1 cable. I ended up leaving the second as a manual failover, since I’ve had my cards get stuck in a state a few times and it’s easier to change the config than run through multiple reboots. Apparently the Spark is limited to 200Gbs/25GBs due to lane limitations; it’s a limitation for the unit as a whole.

I ended up wiping the system and running through the playbooks again. I went a bit Wild West with my network configs, the first time around, because I was struggling to get a carrier on the DAC. Apparently I just needed to do a full shutdown and boot them at the same time to get the carrier.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.