How register Unified Memory for GPUDirect RDMA on POWER9 (AC922) using on demand paging?

I have trouble doing GPUDirect on the AC922. nv_rsync_mem is loaded at boot time
before other nvidia clients.
I tried different methods:

  • using memory allocation on GPU device, I have 39Gb/s RoCEv2 throughput.

    cuMemAlloc(&d_A, size);//allocation on device
    mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);
    
  • using Unified Memory, I have 95Gb/s but no coherency (RDMA data are not updated on GPU, data migration happen only once during the first transfer)

    cuMemAllocManaged(&d_A, size,cudaMemAttachGlobal); 
    mr = ibv_reg_mr(pd, d_A,size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE|);
    

in this case, if I bind the memory region to GPU with

cuMemAdvise(d_A, size, CU_MEM_ADVISE_SET_PREFERRED_LOCATION, 0)

I have 39Gb/s (consistent with dot one)

BTW observed results are okay with intermediary step in CPU host memory.