TL;DR - Is there a single person on these threads running the Ascent GX10 that is currently able to get 20+ GB/s mpirun all_gather_perf results? I saw a thread where someone said downgrading to 0x03000004 resolved this for them, but I came from 0x03000004 (currently 0x03000005) and the issue was persistent for me on 0x03000004 as well before I upgraded.
What was your precise downgrade method? I would love to test the previous version properly, but I don’t see a supported downgrade path and with ASUS’ customer service record over the last couple years I’m scared to brick any of these.
———-
I’ve been experiencing the same quirk with my ASUS DGX Sparks for over a year, but I also had a long hiatus without benchmarking mpirun. Therefore, it’s hard to say if my issue was always firmware or something else in my setup. My results line-up with other threads I’ve seen from justifiably whiny ASUS Spark victims.
Currently I can pull only 14 to17.2 GB/s with mpirun w/ 2GB buffer set. This was true before and after the latest ASUS firmware. Installed both through fwupdmgr as well as the manual update_capsule.sh methods.
I get the same results whether I connect 4-nodes via switch or 2-nodes via direct attach. So, have ruled out mis-shaped flow control or MTU settings and such.
Before I bore you with dozens of pages of logs and the dozens of hours I’ve spent on this, I wanted to craft this introductory post.
– NCCL Library and nccl-test version - have tried the newest version and just about every library over the last year
– MPIRUN - version 4.1.6
– If I don’t set a valid routable IP on on the NIC logical interface secondary pair, I get only 8 GB/s. This is true even if they have IPV4 ink-local addresses, which isn’t surprising as these are non-routable.
– I had Codex write a script to try every possible configuration of NCCL variables and let it run through thousands of test possibilities just to confirm that I wasn’t crazy and all runs settled between the 8 GB/s to 17 GB/s range.
BIOS: GX10DGX.0103.2026.0129.1152 – PD0 FW1: 5.7, FW2: 5.7
-
UEFI: ASUS_UEFI_0103
-
EC firmware: 2.78.18
-
UEFI capsule device version: 0x03000005
-
ASUS PD capsule device version: 0x00000507
-
Mellanox firmware: 28.45.4028
ethtool -i enp1s0f1np1
driver: mlx5_core
version: 6.17.0-1014-nvidia
firmware-version: 28.45.4028 (NVD0000000087)
expansion-rom-version:
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
Lastly, I see so many mpirun support threads, but they almost never include their entire NCCL export set. It would be great if some users who have high performance results could post not only their MPIRUN results, but any possibly related exports that already exist in their shell, not just those specified just prior to run. Many users are misunderstanding how mpirun actually enforces settings on the worker node, not going to get into it in this initial post, but if there’s some quality responses here we can go down that path. :)