Been banging my head against this for a while and wanted to see if anyone else has run into it or if this is just a known limitation I missed somewhere.
I’m connecting my Spark to my UniFi UDM Pro Max via a QSFP-to-SFP+ adapter. Link comes up fine at 10 Gbps, everything looks healthy, but downloads cap out around 2.5 Gbps. Uploads hit 5+ Gbps no problem. My Mac Mini on the same switch gets 4.7 Gbps down, so it’s not the switch or my ISP.
Running DGX OS 7.3.1, kernel 6.14.0-1015-nvidia, firmware 28.45.4028.
I’ve thrown a lot at this:
∙ Cranked up TCP buffers (64MB for rmem/wmem)
∙ Bumped ring buffers to 4096
∙ Enabled hardware GRO, switched to BBR
∙ RSS looks fine, 20 RX queues active, IRQs spread out,
CPU barely breaks a sweat.
Also tried bonding both logical interfaces (enp1s0f1np1 + enP2p1s0f1np1) with balance-xor since I know the QSFP port is split across two PCIe x4 links internally. Both slaves show active but single-flow speeds didn’t budge.
Few things I noticed that might be relevant:
2.5 Gbps is suspiciously close to what you’d get from one PCIe Gen5 x4 link after overhead. Makes me wonder if the RX path is only hitting one of the two links even with bonding.
When I run mlnx_tune it reports “Speed SDR” which is weird for an Ethernet interface? Not sure if that’s just a reporting quirk or actually meaningful.
The fact that uploads are fine at 5 Gbps tells me the hardware can do it…something’s just off with the receive side specifically.
So I guess my questions are:
- Is this ~2.5 Gbps RX limit just how it works when you’re going through a QSA adapter? Like is this a known thing with how the ConnectX-7 is wired up in the Spark?
- Any driver tweaks or firmware settings worth trying?
- What’s up with mlnx_tune showing SDR speed on an Ethernet link?
I get that the ConnectX-7 is really meant for connecting two Sparks together at 200G, not for running degraded through an adapter. The reason I’m not just using the Realtek 10G port is that the ConnectX-7 offloads a lot of the network processing to hardware…checksums, segmentation, etc. On a thermally constrained system where I’m also running inference workloads, keeping that stuff off the CPU actually matters. Less interrupt overhead, lower thermals, more cycles for actual work. So if there’s a way to make this work properly I’d rather stick with the QSFP path. I am using the other port for my second spark/clustering. Figured I would utilize both.
Just want to know if I’m chasing something that’s actually fixable or if this is just the architecture doing its thing.
Appreciate any insight. Happy to provide more details if it helps.