I am doing data transfer between workstation and one NVIDIA gpu. I am using RoCEv2 UD queue pair with SEND/RECEIVE verbs. Hardware is 6 cores x86_64, connectx-5 fiber using direct link (no switch, workstation to workstation), nvidia quadro p6000
I do large transfer : 4096 bytes buffer, 4096*2 work request list, iterates 1000 times
- sender to receiver using memory backed by hugepage (no gpu) : 97.4 Gb/ sustainable OK
- sender to receiver using gpu memory with nv_peer_mem kernel module , bandwidth starts around 70Gb/s ok BUT then fall slowly (couple of seconds) to 20 Gb/s BAD!!!
There are no packet drop (lively verified), but throughput decreases. Nothing visible on wireshark, with/without sniffer
Some remarks :
- sender alone, without running receiver : 97.4Gb/s
- sender alone, without receiver but RX side connectx-5 with SNIFFER on : 75 Gb/s. this fact let me think that TX NIC has discovered the state of the receiver NIC.
Any idea on this issue ?