In my setup I have 8 processes running with 1 GPU each.
- each process allocates a chunk of device memory, and creates a handle with cudaIpcGetMemHandle
- rank 0 receives handles from 7 other process and obtains 7 pointers via cudaIpcOpenMemHandle
- rank 0 tries to do a cudaMemcpyAsync with different streams from each of 7 peers to get the content of memory
I would assume that there should be some sort of parallelization between these 7 memcpy calls using streams. However I am not getting any performance gains compared to 7 blocking cudaMemcpy calls, and a visualization through Nsys shows that they are still all serialized (although I can see the different streams, and copies are happening out-of-order). As shown in below screenshot:
I tried to change the source pointers from peer buffers of other GPU processes to a device buffer allocated locally and can see the copies overlap and have considerable performance gain.
Is there anything preventing p2p memcpy calls from happening in parallel? I couldn’t find any documentations on it so would appreciate some help. Thanks