GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2

JOJOTOOL · December 19, 2025, 3:23pm

Hi everyone,

I am trying to implement GPUDirect RDMA (GDR) on a setup involving two identical machines, but I am hitting a hard bandwidth bottleneck of around 32-38 Gbps, despite the hardware theoretically supporting 100 Gbps.
Hardware Setup (Per Node):

Motherboard: ASUS WS X299 SAGE/10G
CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz
GPU: NVIDIA Tesla T4
NIC/DPU: NVIDIA BlueField-2 (configured as Ethernet/RoCE)
OS: Ubuntu 22.04 LTS
Driver: NVIDIA Driver 550 (Production Branch)
CUDA: 12.3
Mellanox OFED Driver: OFED-internal-25.07-0.9.7

The Issue: When running ib_write_bw with GPUDirect enabled, the bandwidth is capped at ~38 Gbps. However, checking individual components confirms they are capable of full speed:

PCIe Link Status: Both the GPU and DPU are negotiated at Gen3 x16 (8GT/s) via lspci.
Pure NIC Performance: Standard RDMA test (ib_write_bw without CUDA) reaches ~96 Gbps, confirming the network link is healthy.
Host-to-Device Bandwidth: CUDA bandwidthTest (pinned memory) shows ~12 GB/s, confirming the GPU PCIe lane is operating at x16 speed.
Topology: nvidia-smi topo -m shows PIX, indicating both devices are under the same PCIe root complex/switch.

Troubleshooting Steps Taken:

Enabled Above 4G Decoding and disabled Secure Boot in BIOS.
Enabled ACS Override in Grub: pcie_acs_override=downstream,multifunction.
Set MTU to 9000 (Jumbo Frames) on both interfaces.
Verified that the issue persists in both Loopback and Node-to-Node tests.

Hypothesis: Given the ASUS X299 SAGE uses PLX switches to expand PCIe lanes, I suspect the bottleneck lies in the P2P routing path through the PLX chips or a DMI limitation (limiting throughput to ~PCIe Gen3 x4 speeds), even though the link width reports x16.

Has anyone experienced similar P2P performance issues on X299 platforms with PLX chips? Are there specific BIOS settings or slot configurations recommended for this motherboard to enable full P2P throughput(~100 Gbps)?

Thanks.

JOJOTOOL · December 21, 2025, 12:57pm

I have fantastic news! The issue is finally resolved, and I’ve managed to break the 38 Gbps barrier.

The Root Cause: It turned out to be the ACS (Access Control Services) configuration on the onboard PLX PEX8747 switches. Even though the T4 and BlueField-2 were physically connected to the same PLX chip (Slot 1 and Slot 3), I discovered via lspci that the BIOS enables ACS (SrcValid+ and UpstreamFwd+) on the PLX downstream ports by default.

This configuration was effectively blocking internal P2P routing within the PLX chip and forcing all traffic to be redirected up to the CPU Root Complex. This caused the data to traverse the DMI link, which explains exactly why I was capped at ~38-40 Gbps (the DMI 3.0 bandwidth limit).

The Solution: I used setpci to manually disable the ACS bits on all PLX bridges. As soon as I did this, the “detour” to the CPU was removed.

The Result: After running the script, ib_write_bw instantly jumped from 38 Gbps to ~92.19 Gbps! (Screenshot attached)

Topic		Replies	Views
GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2 CUDA Programming and Performance	5	65	December 21, 2025
RDMA GPU Direct Slow CUDA Programming and Performance	10	2734	February 13, 2019
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	3309	April 11, 2019
GPUDirect Performance : 25% less bandwidth than CudaMemcpy from host pinned memory Software And Drivers	6	520	February 13, 2019
GPUDirect RDMA performance CUDA Programming and Performance	2	2247	March 26, 2013
P2P DMA performance limitation? where a single CPU is invoked CUDA Programming and Performance	3	1712	November 27, 2017
POWER9 GPUDirect poor performance (39Gb/s Connectx-5 to Tesla only) CUDA Programming and Performance	7	1296	April 26, 2020
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	4922	November 7, 2023
How can I test the Peer to Peer RDMA PCIe bandwidth between a single MLNX_CX5 NIC and a CUDA capable GPU InfiniBand/VPI Adapter Cards software-and-drivers , adapters-and-cables , opensm	3	1586	July 14, 2020
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	3074	April 11, 2023

GPUDirect RDMA Bandwidth Bottleneck (~38Gbps) on ASUS WS X299 SAGE/10G with Tesla T4 + BlueField-2

Related topics