How register Unified Memory for GPUDirect RDMA on POWER9 (AC922) using on demand paging?

raph38130 · May 14, 2020, 6:48pm

I have trouble doing GPUDirect on the AC922. nv_rsync_mem is loaded at boot time
before other nvidia clients.
I tried different methods:

using memory allocation on GPU device, I have 39Gb/s RoCEv2 throughput.

cuMemAlloc(&d_A, size);//allocation on device
mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);

using Unified Memory, I have 95Gb/s but no coherency (RDMA data are not updated on GPU, data migration happen only once during the first transfer)
```
cuMemAllocManaged(&d_A, size,cudaMemAttachGlobal); 
mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);
```

in this case, if I bind the memory region to GPU with

cuMemAdvise(d_A, size, CU_MEM_ADVISE_SET_PREFERRED_LOCATION, 0)

I have 39Gb/s (consistent with dot one)

I guess that using Unified Memory requires ON DEMAND PAGING, following the slide below (http://on-demand.gputechconf.com/gtc/2018/presentation/s8474-gpudirect-life-in-the-fast-lane.pdf). Doing so, the HCA does not receive the RDMA packets anymore:
```
cuMemAllocManaged(&d_A, size,cudaMemAttachGlobal); //Unified Memory
mr = ibv_reg_mr(pd, 0, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_ON_DEMAND);
```

BTW observed results are okay with intermediary step in CPU host memory.