DGX Spark direct QSFP connection only getting ~13-16 Gbps instead of expected 200G performance

ugurbas · May 13, 2026, 2:05pm

Hi,

I connected two DGX Spark nodes directly using a QSFP cable (Amphenol njaakk-n911) and followed the NVIDIA NCCL / RoCE setup guide.

Configuration:

Direct QSFP connection between the Sparks
Interfaces:
- Node1: enp1s0f1np1 → 169.254.246.117
- Node2: enp1s0f1np1 → 169.254.224.160
MTU tested with 1500 and 9000
Jumbo ping works
ethtool shows:
- Speed: 200000Mb/s
- Link detected: yes
PCIe link:
- 32GT/s x4

NCCL works and uses IB/RoCE:

NCCL INFO Using network IB
NCCL INFO NET/IB

However, performance is very low:

NCCL all_gather_perf:
- Avg bus bandwidth: ~2.8 GB/s
iperf3:
- ~13-16 Gbps
ib_write_bw:
- ~12.7-13.5 Gbps

I also tested:

larger buffers
multiple QPs
second P2P-visible interface
separate /24 addressing
MTU 9000

Results remain around ~13 Gbps.

There are no CRC errors, and RDMA counters increase normally.

Is this expected on DGX Spark / GB10, or should I be seeing much higher throughput (~90-100 Gbps+) like other reported Spark tests?

Could this be related to:

CX-7 multi-host mode,
wrong PF/interface selection,
cable compatibility,
firmware/driver issue,
or some missing RoCE configuration?

Thanks.

giles8 · May 13, 2026, 2:14pm

I can’t directly help you (yet, I’m still waiting on a second GB10 to be delivered), but I do notice that you intertwined GBps and Gbps in your post (1GBps = 8Gbps), that being Gigabytes per second vs Gigabits per second. Might be best to stick to just one. I think your maximum expectations should probably be about 180Gbps or 22.5GBps in a iperf3 multiple concurrent UDP streams test.

azampatti · May 13, 2026, 4:02pm

What are you testing with? Maybe a dumb question from me but, if you’re testing file-transfer you’ll be limited by NVME drive which might be around that speed.

You need to test memory-to-memory reads :)

raphael.amorim · May 13, 2026, 4:36pm

Hello,

Make sure both your nodes are fully updated:

sudo apt update
sudo apt dist-upgrade
sudo fwupdmgr refresh
sudo fwupdmgr upgrade
sudo reboot

I don’t know if you really have DGX Spark really (Assuming so) but if it’s another vendor’s GB10:

sudo fwupdmgr enable-remote lvfs-testing
sudo fwupdmgr refresh --force
sudo fwupdmgr update

in addition to that

sudo shutdown -h now
# unplug USB-C power from the back of both Sparks and unplug bricks from the outlet
# wait ~5 minute
# plug back in and boot both

Make sure you connect connectX-7 cage1 with cage1 or cage2 with cage2 on both GB10 devices, don’t do different ports on both devices. If after all of this is not working, and you’re confused still, use sparkrun to configure the network for you:

other thread references:

sjug · May 13, 2026, 5:51pm

Is there some way to fix them and make the Detected insufficient power on the PCIe slot errors go away?

raphael.amorim · May 13, 2026, 7:46pm

It doesn’t appear to affect anything in practice.

For example, if you inspect one of the ConnectX ports, you’ll notice that the reported PCIe slot power limit is 0W.

From lspci -vvv:

DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W

That likely triggers the driver warning because a standard full x16 PCIe slot is expected to provide up to 75W, and the driver may assume that behavior. In this case, though, the port is only connected as x4, and lspci does not show any declared slot power limit. So the warning seems to be caused by missing or unusual platform-reported power metadata rather than an actual functional power issue.

giles8 · May 13, 2026, 7:49pm

Yes, the insufficient power message has been brought up in various thread and is apparently spurious and benign, and can be ignored.

ugurbas · May 14, 2026, 12:58pm

Hi all,

Thank you for your interest in this topic.

While searching the forum, I found discussions mentioning compatibility issues between kernel 6.17.0 and ConnectX-7 firmware. NVIDIA had apparently released a firmware update to address this.

When I checked my DGX Spark systems, one node was already running the firmware version mentioned in those discussions, while another node had an even newer version. I upgraded all systems to the latest firmware version available to me (1.108.20 ).

After the upgrade, two Sparks were finally able to communicate at around 100Gbps according to iperf:

[SUM]   0.00-30.00  sec   388 GBytes   111 Gbits/sec    0             sender
[SUM]   0.00-30.01  sec   388 GBytes   111 Gbits/sec                  receiver

I then expanded the setup to a 3-node ring topology following the official documentation:

1. Node1 (Port0) -> Node2 (Port1)
2. Node2 (Port0) -> Node3 (Port1)
3. Node3 (Port0) -> Node1 (Port1)

Strangely, after creating the ring topology, iperf bandwidth dropped again to around 10Gbps even though all firmware and software versions were up to date.

I checked dmesg and found these messages:

Detected insufficient power on the PCIe slot (27W)
AER: Multiple Uncorrectable (Fatal) error

I understand that the “insufficient power” warning is expected/known on these systems, but the fatal AER error looked concerning.

What eventually fixed the issue was:

shutting down all Sparks,
unplugging power and QSFP cables,
waiting a few minutes,
pressing the power button while unplugged to discharge residual power,
reconnecting everything and booting again.

After doing this, iperf returned to ~100Gbps and the fatal AER errors disappeared. The “insufficient power” warnings are still present, but performance has remained stable for about 7 hours now.

At this point I am not fully sure what originally caused the 10Gbps behavior after enabling the ring topology, but for now the cluster appears stable and operating correctly.

I hope this information helps someone else facing a similar issue.

robert287 · May 14, 2026, 2:06pm

27W is a normal warning until the hot plug detects the cables, the system keeps thems in a low power state.

The issue with unplugging and holding the power button for a full drain is to clear another issue a CPU low power state that can happen if/when your system runs into OOM crashes, and of course if the CPU is running in low power mode it can’t chunk data at the ConnectX-7 port .

elsaco · May 14, 2026, 3:38pm

On DGX Spark there are no PCIe slots. All devices are connected directly to the SoC. If you run lspci -vv and grep for SlotPowerLimit all are 0W, meaning there’s no physical PCI slot and not a power limit issue. I think the firmware sets SlotPowerLimit=0 causing the driver to report an insufficient power detection.

robert287 · May 14, 2026, 3:51pm

while you are right there are’n’t slots there are PCIe lanes to the ConnectX-7 Ports and the firmware absolutely does power limits on them. It what the hotplug fix is for is to keep the connectX-7 ports in a low power state when there are no active connections in the port.

Topic		Replies	Views
DGX Spark 200GbE direct QSFP link negotiates 200G but payload is ~12 Gbps DGX Spark / GB10 dgx-spark-issue	3	334	June 17, 2026
ConnectX-7 Inter-Spark Link Capped at ~13 Gbps (Expected 200 Gbps) — PCIe Power Throttling (27W) DGX Spark / GB10 spark , dgx-spark-issue	7	757	June 30, 2026
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	9	601	April 19, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	66	7126	December 2, 2025
Low throughput on 200G QSFP56 DAC link between two NVIDIA DGX Spark systems DGX Spark / GB10	4	473	May 1, 2026
Dual DGX Spark RoCE Bandwidth Expectations DGX Spark / GB10	20	1072	May 14, 2026
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	1963	January 10, 2026
Dual DGX Spark Network issue DGX Spark / GB10 networking , spark , dgx , dgx-spark-issue	6	406	June 1, 2026
One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why DGX Spark / GB10	49	1838	March 17, 2026
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	5	470	April 14, 2026

DGX Spark direct QSFP connection only getting ~13-16 Gbps instead of expected 200G performance

Related topics