Just another ASUS GX10 NCCL all_gather_perf thread... mpirun... please read if you have an ASUS model multinode setup

TL;DR - Is there a single person on these threads running the Ascent GX10 that is currently able to get 20+ GB/s mpirun all_gather_perf results? I saw a thread where someone said downgrading to 0x03000004 resolved this for them, but I came from 0x03000004 (currently 0x03000005) and the issue was persistent for me on 0x03000004 as well before I upgraded.

What was your precise downgrade method? I would love to test the previous version properly, but I don’t see a supported downgrade path and with ASUS’ customer service record over the last couple years I’m scared to brick any of these.

———-

I’ve been experiencing the same quirk with my ASUS DGX Sparks for over a year, but I also had a long hiatus without benchmarking mpirun. Therefore, it’s hard to say if my issue was always firmware or something else in my setup. My results line-up with other threads I’ve seen from justifiably whiny ASUS Spark victims.

Currently I can pull only 14 to17.2 GB/s with mpirun w/ 2GB buffer set. This was true before and after the latest ASUS firmware. Installed both through fwupdmgr as well as the manual update_capsule.sh methods.

I get the same results whether I connect 4-nodes via switch or 2-nodes via direct attach. So, have ruled out mis-shaped flow control or MTU settings and such.

Before I bore you with dozens of pages of logs and the dozens of hours I’ve spent on this, I wanted to craft this introductory post.

– NCCL Library and nccl-test version - have tried the newest version and just about every library over the last year
– MPIRUN - version 4.1.6
– If I don’t set a valid routable IP on on the NIC logical interface secondary pair, I get only 8 GB/s. This is true even if they have IPV4 ink-local addresses, which isn’t surprising as these are non-routable.
– I had Codex write a script to try every possible configuration of NCCL variables and let it run through thousands of test possibilities just to confirm that I wasn’t crazy and all runs settled between the 8 GB/s to 17 GB/s range.

BIOS: GX10DGX.0103.2026.0129.1152 – PD0 FW1: 5.7, FW2: 5.7

  • UEFI: ASUS_UEFI_0103

  • EC firmware: 2.78.18

  • UEFI capsule device version: 0x03000005

  • ASUS PD capsule device version: 0x00000507

  • Mellanox firmware: 28.45.4028

ethtool -i enp1s0f1np1
driver: mlx5_core
version: 6.17.0-1014-nvidia
firmware-version: 28.45.4028 (NVD0000000087)
expansion-rom-version:
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Lastly, I see so many mpirun support threads, but they almost never include their entire NCCL export set. It would be great if some users who have high performance results could post not only their MPIRUN results, but any possibly related exports that already exist in their shell, not just those specified just prior to run. Many users are misunderstanding how mpirun actually enforces settings on the worker node, not going to get into it in this initial post, but if there’s some quality responses here we can go down that path. :)

I had the exact same experience with my ASUS Ascent GX10 + Gigabyte AI TOP ATOM combo. Started at 13 GB/s with all_gather_perf, eventually got to a stable 20.21 GB/s busbw after a series of fixes. Sharing the full path in case it helps you debug.

Firmware path

I run BIOS GX10DGX.0103.2026.0129.1152 and Mellanox 28.45.4028 (NVD0000000087) — same as you. I did not find downgrading necessary; the issue turned out to be NCCL config + dual-HCA topology, not firmware itself.

That said, before benchmarking, make sure you’ve pulled the absolute latest from LVFS testing channel:

sudo fwupdmgr enable-remote lvfs-testing
sudo fwupdmgr refresh --force
sudo fwupdmgr update

Worth checking even if you think you’re current.

The dual-HCA discovery

This is the most important thing in my journey. Each Spark’s QSFP56 200G port physically routes to both ConnectX-7 HCAs simultaneously (PCIe Gen5 x4 to each). If you only assign IPs/configure one HCA, you’re capping yourself at ~13-17 GB/s. Confirmed by users sajid0405103 and itstexmex on related threads.

Check which interfaces you have active:

ibdev2netdev
ip -br addr show | grep enp

You should see both rocep1s0f0/enp1s0f0np0 AND roceP2p1s0f0/enP2p1s0f0np0 with valid IPs on different subnets, both with MTU 9000.

My setup (saved persistently in netplan):

  • enP2p1s0f0np0 = 10.20.20.X/24 MTU 9000

  • enp1s0f0np0 = 10.20.30.X/24 MTU 9000

NCCL config that gave me 20.21 GB/s

After ~500 NCCL config combinations tested, this is the sweet spot for cluster inference workloads (vLLM, Ray):

NCCL_NET_PLUGIN=none
NCCL_SOCKET_IFNAME=enP2p1s0f0np0,enp1s0f0np0
NCCL_IB_HCA=roceP2p1s0f0,rocep1s0f0
NCCL_IB_GID_INDEX=3
NCCL_IB_MTU=5
NCCL_IB_PCI_RELAXED_ORDERING=1
NCCL_IB_QPS_PER_CONNECTION=8
NCCL_IB_SPLIT_DATA_ON_QPS=1
NCCL_MIN_NCHANNELS=32
NCCL_IB_MERGE_NICS=1
NCCL_IGNORE_CPU_AFFINITY=1

Key insights from my testing:

  1. NCCL_IB_MERGE_NICS=1 is critical when both HCAs are configured — without it NCCL doesn’t fully utilize the second card. Alone gives no benefit, but combined with channel tuning unlocks ~5 GB/s.

  2. NCCL_MIN_NCHANNELS=32 beats 64 for me. 64 channels gave 19.36 GB/s on all_gather_perf, 32 channels gave 21.19 GB/s peak. More channels = more setup overhead per chunk on the dual-HCA path. YMMV, worth testing both with your specific buffer sizes.

  3. NCCL_IB_QPS_PER_CONNECTION=8 + SPLIT_DATA_ON_QPS=1 matter. Default of 4 caps at ~17 GB/s.

  4. NCCL_NET_PLUGIN=none — disable the auto-loaded plugin, it interferes with the explicit HCA setup.

Realistic ceiling on ASUS GX10

Real DGX Spark FE users report 22+ GB/s. With ASUS in the mix (mine is ASUS + Gigabyte combo), 20-21 GB/s seems to be the realistic ceiling — about 90-95% of FE performance.

My benchmark command for reference

bash

mpirun -np 2 -H 10.20.20.1:1,10.20.20.2:1 \
  --allow-run-as-root \
  --mca plm_rsh_agent "ssh -x -o ForwardX11=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  --mca oob_tcp_if_include enP2p1s0f0np0 \
  --mca btl_tcp_if_include enP2p1s0f0np0 \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  -x NCCL_SOCKET_IFNAME=enP2p1s0f0np0,enp1s0f0np0 \
  -x NCCL_IB_HCA=roceP2p1s0f0,rocep1s0f0 \
  -x NCCL_NET_PLUGIN=none \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_IB_MTU=5 \
  -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
  -x NCCL_IB_QPS_PER_CONNECTION=8 \
  -x NCCL_IB_SPLIT_DATA_ON_QPS=1 \
  -x NCCL_MIN_NCHANNELS=32 \
  -x NCCL_IB_MERGE_NICS=1 \
  -x NCCL_DEBUG=WARN \
  $HOME/nccl-tests/build/all_gather_perf -b 1G -e 4G -f 2

NCCL version 2.29.7+cuda13.2 from the NGC container.

Hope this saves you some hours. If you try the dual-HCA + MERGE_NICS=1 + MIN_NCHANNELS=32 combo and still cap at 17 GB/s, post your ibdev2netdev and ip addr output — happy to compare topologies.

Wow, thank you so much for taking the time to write this. You solved my longstanding issue.

For me, the only thing I needed was the lvfs-testing force! Which is interesting because all of my firmware was reporting the versions you already saw, but when I ran it I did get these updates:

Upgrade Embedded Controller from 0x02000004 to 0x02000005? Y
Upgrade UEFI Device Firmware from 0x03000005 to 0x03000006? Y

So, yes, this is even newer than the March 18th (newest firmware) on ASUS site. Immediately upon reboot, mpirun was finally fast for the first time ever:

Avg bus bandwidth : 21.267

Note: It looks like having the secondary logical interface on an explicitly separate subnet is not required, mine is on the same subnet (eg. 10.10.10.2 + 10.10.10.3) unlike your setup.

Hope you have a good day, thank you once again!

I don’t know if those switches give use real improvements or if they give us the illusion… I’m getting strange numbers :D

#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   356107   48.24   24.12       0   353493   48.60   24.30       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.2109
#
# Collective test concluded: all_gather_perf
#

Following on from karol’s dual-hca config — I packaged it into a repo so the next person doesn’t have to forum-scrape for two days like I did (more like 2 full days spread across 2 weeks).

sm_121 nccl rebuild + NCCL_IB_MERGE_NICS=1 + a networkmanager profile for the second-half ip (which the os won’t set up for you). measured 11.5 → 24.1 gbps allreduce on a pair of asus gx10.

My readme covers the GDR rabbit-hole, too (short version: it’s architectural on gb10, stop chasing it).