How to implement cuda IPC efficiently

Hi cuda experts,

I have a question on implementing cuda ipc. Let’s say I have two devices on two machines. On each device,I first create a handle for the buffer to be transferred and then open the handle:

cudaIpcGetMemHandle((cudaIpcMemHandle_t*)&shm->memHandle[0], buffer0)
cudaIpcOpenMemHandle(&ptr0, *(cudaIpcMemHandle_t *)&shm->memHandle[0], cudaIpcMemLazyEnablePeerAccess)

I do the same thing on another device. My question is how to combine the result of ptr0 and ptr1 from another device 1, and then save the obtained result on device 0.