GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2

JOJOTOOL · December 19, 2025, 4:14pm

Hi everyone,

I am trying to implement GPUDirect RDMA (GDR) on a setup involving two identical machines, but I am hitting a hard bandwidth bottleneck of around 32-38 Gbps, despite the hardware theoretically supporting 100 Gbps.
Hardware Setup (Per Node):

Motherboard: ASUS WS X299 SAGE/10G
CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz
GPU: NVIDIA Tesla T4
NIC/DPU: NVIDIA BlueField-2 (configured as Ethernet/RoCE)
OS: Ubuntu 22.04 LTS
Driver: NVIDIA Driver 550 (Production Branch)
CUDA: 12.3
Mellanox OFED Driver: OFED-internal-25.07-0.9.7

The Issue: When running ib_write_bw with GPUDirect enabled, the bandwidth is capped at ~38 Gbps. However, checking individual components confirms they are capable of full speed:

PCIe Link Status: Both the GPU and DPU are negotiated at Gen3 x16 (8GT/s) via lspci.
Pure NIC Performance: Standard RDMA test (ib_write_bw without CUDA) reaches ~96 Gbps, confirming the network link is healthy.
Host-to-Device Bandwidth: CUDA bandwidthTest (pinned memory) shows ~12 GB/s, confirming the GPU PCIe lane is operating at x16 speed.
Topology: nvidia-smi topo -m shows PIX, indicating both devices are under the same PCIe root complex/switch.

Troubleshooting Steps Taken:

Enabled Above 4G Decoding and disabled Secure Boot in BIOS.
Enabled ACS Override in Grub: pcie_acs_override=downstream,multifunction.
Set MTU to 9000 (Jumbo Frames) on both interfaces.
Verified that the issue persists in both Loopback and Node-to-Node tests.

Hypothesis: Given the ASUS X299 SAGE uses PLX switches to expand PCIe lanes, I suspect the bottleneck lies in the P2P routing path through the PLX chips or a DMI limitation (limiting throughput to ~PCIe Gen3 x4 speeds), even though the link width reports x16.

Has anyone experienced similar P2P performance issues on X299 platforms with PLX chips? Are there specific BIOS settings or slot configurations recommended for this motherboard to enable full P2P throughput(~100 Gbps)?

Thanks.

rs277 · December 19, 2025, 6:09pm

Just an observation that could be irrelevant:

I’m assuming, “both”, means GPU and NIC.

The ib_write_bw result shows GPU0 in use. The nvidia-smi topo output shows GPU0 connected via SYS to the NICs.

JOJOTOOL · December 19, 2025, 6:38pm

Hi,

Thank you so much for looking into this!

I completely understand why it looks like I might be using the wrong GPU—the device indexing between these tools can be quite confusing.

I noticed that ib_write_bw (CUDA-based) and nvidia-smi seem to sort the devices differently on my system:

In the ib_write_bw log: It selects Device 0, which it explicitly names [Tesla T4]. I believe CUDA defaults to sorting by compute capability.
In nvidia-smi: It lists the T4 as GPU 1 (based on the PCIe bus ID), while the Quadro K620 (Display GPU) is GPU 0.

So, if we look at the topology matrix again, I think we should be focusing on the row for GPU 1 (the T4). It shows a PIX connection to NIC 0, which suggests they are under the same PCIe switch.

Since the topology appears to be correct (PIX) and link speeds are fine, I’m puzzled by the ~38 Gbps cap. Do you think this specific number (which aligns closely with the DMI bandwidth) points to a routing issue within the PLX chips on this motherboard?

Thanks again for your help!

rs277 · December 19, 2025, 7:27pm

Sorry, I’ve no experience with your scenario, just the observation on the data as presented.

Looking at the manual for your motherboard, Appendix A1, 48 lane, it seems PCI slot 1 is the only true x16 one. All other slots only have x8 back to the switches.

Edit: I see now some slots have two connections. Maybe if you haven’t tried already, experimentation with slot placement of both cards could be worthwhile.

Slot 1 and Slot 3 or Slot 5 and Slot 7. If using 1 and 3, make sure 2 is empty. For 5 and 7, make sure 4 and 6 are empty. Full x16 mode on both wanted slots may not occur otherwise.

JOJOTOOL · December 21, 2025, 7:09am

Hi rs277,

I have fantastic news! The issue is finally resolved, and I’ve managed to break the 38 Gbps barrier.

The Root Cause: It turned out to be the ACS (Access Control Services) configuration on the onboard PLX PEX8747 switches. Even though the T4 and BlueField-2 were physically connected to the same PLX chip (Slot 1 and Slot 3), I discovered via lspci that the BIOS enables ACS (SrcValid+ and UpstreamFwd+) on the PLX downstream ports by default.

This configuration was effectively blocking internal P2P routing within the PLX chip and forcing all traffic to be redirected up to the CPU Root Complex. This caused the data to traverse the DMI link, which explains exactly why I was capped at ~38-40 Gbps (the DMI 3.0 bandwidth limit).

The Solution: I used setpci to manually disable the ACS bits on all PLX bridges. As soon as I did this, the “detour” to the CPU was removed.

The Result: After running the script, ib_write_bw instantly jumped from 38 Gbps to ~92.19 Gbps! (Screenshot attached)

Thank you so much for your help regarding the slot topology logic—that was the crucial first step that led me to investigate why the “correct” topology wasn’t behaving as expected. Hopefully, this solution helps others struggling with GPUDirect RDMA on X299 workstations!

rs277 · December 21, 2025, 5:55pm

Thanks. ACS is not something I’ve had to deal with, so I’ve learnt something here.

Topic		Replies	Views
GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2 BlueField	1	105	December 21, 2025
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	3547	April 11, 2019
RDMA GPU Direct Slow CUDA Programming and Performance	10	2933	February 13, 2019
GPUDirect Performance : 25% less bandwidth than CudaMemcpy from host pinned memory Software And Drivers	6	609	February 13, 2019
GPUDirect RDMA performance CUDA Programming and Performance	2	2265	March 26, 2013
POWER9 GPUDirect poor performance (39Gb/s Connectx-5 to Tesla only) CUDA Programming and Performance	7	1355	April 26, 2020
GPU Direct + PCIe topology CUDA Programming and Performance	2	761	June 27, 2024
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	5408	November 7, 2023
GPUDirect RDMA PCIe Topology CUDA Programming and Performance pcie	2	1253	October 23, 2021
How can I test the Peer to Peer RDMA PCIe bandwidth between a single MLNX_CX5 NIC and a CUDA capable GPU InfiniBand/VPI Adapter Cards software-and-drivers , adapters-and-cables , opensm	3	1697	July 14, 2020

GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2

Related topics