slow bandwidth using NVIDIAP6000 GpuDirect. Is there an hidden handshaking between connectx5 ethernet / RoCE ?

Dear all,

I am doing data transfer between workstation and one NVIDIA gpu. I am using RoCEv2 UD queue pair with SEND/RECEIVE verbs. Hardware is 6 cores x86_64, connectx-5 fiber using direct link (no switch, workstation to workstation), nvidia quadro p6000

I do large transfer : 4096 bytes buffer, 4096*2 work request list, iterates 1000 times

  • sender to receiver using memory backed by hugepage (no gpu) : 97.4 Gb/ sustainable OK
  • sender to receiver using gpu memory with nv_peer_mem kernel module , bandwidth starts around 70Gb/s ok BUT then fall slowly (couple of seconds) to 20 Gb/s BAD!!!

There are no packet drop (lively verified), but throughput decreases. Nothing visible on wireshark, with/without sniffer

Some remarks :

  • sender alone, without running receiver : 97.4Gb/s
  • sender alone, without receiver but RX side connectx-5 with SNIFFER on : 75 Gb/s. this fact let me think that TX NIC has discovered the state of the receiver NIC.

Any idea on this issue ?

best regards

actually slowdown is related to Global Pause that is activated by default.

the question remains : why is writing to GPU memory device so slow ?

Nvidia Quadro P6000 is a PCIe gen3 x16 device and we have measured transfer up to 100Gb/s using cudaMemcpy to do transfer from pinned host memory (that uses gpu DMA ) ?

our workstation has 48 PCIe lanes