I have trouble doing GPUDirect on the AC922. nv_rsync_mem is loaded at boot time
before other nvidia clients.
I tried different methods:
-
using memory allocation on GPU device, I have 39Gb/s RoCEv2 throughput.
cuMemAlloc(&d_A, size);//allocation on device mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);
-
using Unified Memory, I have 95Gb/s but no coherency (RDMA data are not updated on GPU, data migration happen only once during the first transfer)
cuMemAllocManaged(&d_A, size,cudaMemAttachGlobal); mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);
in this case, if I bind the memory region to GPU with
cuMemAdvise(d_A, size, CU_MEM_ADVISE_SET_PREFERRED_LOCATION, 0)
I have 39Gb/s (consistent with dot one)
-
I guess that using Unified Memory requires ON DEMAND PAGING, following the slide below (http://on-demand.gputechconf.com/gtc/2018/presentation/s8474-gpudirect-life-in-the-fast-lane.pdf). Doing so, the HCA does not receive the RDMA packets anymore:
cuMemAllocManaged(&d_A, size,cudaMemAttachGlobal); //Unified Memory mr = ibv_reg_mr(pd, 0, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_ON_DEMAND);
BTW observed results are okay with intermediary step in CPU host memory.