Unable to Make p2p cudaMemcpyAsync calls between GPU processes parallel

In my setup I have 8 processes running with 1 GPU each.

  1. each process allocates a chunk of device memory, and creates a handle with cudaIpcGetMemHandle
  2. rank 0 receives handles from 7 other process and obtains 7 pointers via cudaIpcOpenMemHandle
  3. rank 0 tries to do a cudaMemcpyAsync with different streams from each of 7 peers to get the content of memory

I would assume that there should be some sort of parallelization between these 7 memcpy calls using streams. However I am not getting any performance gains compared to 7 blocking cudaMemcpy calls, and a visualization through Nsys shows that they are still all serialized (although I can see the different streams, and copies are happening out-of-order). As shown in below screenshot:

I tried to change the source pointers from peer buffers of other GPU processes to a device buffer allocated locally and can see the copies overlap and have considerable performance gain.

Is there anything preventing p2p memcpy calls from happening in parallel? I couldn’t find any documentations on it so would appreciate some help. Thanks

It’s not clear why you would expect memcopy operations targetting a single GPU to operate in parallel. There is only one pipe/connection to that GPU. When one copy is occupying the pipe, there is no reason to think that another copy can run “in parallel”. And even if it could, there is no reason to assume that there would be any improvement in performance. I’m defining performance here as quantity of data transferred per unit time. (hint, that definition is bandwidth. The bandwidth of the pipe is fixed.)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.