I am doing RDMA data transfer between workstation and a NVIDIA gpu. I am using RDMA RoCEv2 UD queue pair with SEND/RECEIVE verbs. Hardware is 8/16 cores x86_64, with mellanox connectx-5 100Gb/s , using direct fiber link (no switch, workstation to workstation), nvidia quadro p6000 on the same NUMA set
I do large transfer : 4096 bytes buffer, 4096*2 works requests, iterates 1000 times
- sender to receiver using memory backed by hugepage (no gpu) : 97.4 Gb/ sustainable OK
- sender to receiver using gpu memory mapped with nv_peer_mem kernel module, bandwidth starts around 70Gb/s ok BUT then fall slowly (in a couple of seconds) to 20 Gb/s BAD!!!
There are no packet drop (lively verified), but throughput decreases. Nothing visible on wireshark, with/without sniffer.
on this workstation, bandwidth test using host pinned memory (cuda sample) reports 12GB/s
any ideas ?