Using IPC to move data from one GPU to another

Dear all,

I hope I am writing at the right place.
I am trying to move data from one GPU to another.
(I must precise that I do not want to use any cuMemcpy.
In a sense, I am reproducing that happens with cuMemcpyPToP.)

Using unified addressing, one can write a simple kernel
[…] if ( i < m ) w[i] = v[i]; […]
where v and w are arrays of size m, and i is the appropriate index.

Now consider two devices dev_0 and dev_1, arrays v and w are allocated on dev_0, dev_1, respectively.
In that case, with the appropriate setup (see below) this kernel works.
But the problem is how to notify the target, i.e., dev_1, that the data are actually in its memory?
For that we have used event and wait routines.
Unfortunately, this mechanism does not work : if one prints the data on dev_1, sometimes all the data are read, sometimes only part of the data. As example,
b[0] = 0.00000
b[1] = 0.00000
b[2] = 0.00000
b[3] = 0.00000
b[4] = 0.86748
b[5] = 0.83402
b[6] = 0.81273

From this print, we can assume that when dev_1 reads elements 0 to 3, the data are not visible or committed or simply not arrived yet. Then the data are visible and can be accessed so that the remaining elements are not zero.

Now are my questions :
1/ We record an event associated with the moving data kernel. When the kernel has completed, does it mean the data are sent by dev_0 or the data are received by dev_1 ?
2/ How can we rely on CUDA driver to get the event/notification that the data have been written on dev_1 ?
3/ Any other suggestion ?


Here is the setup code that I am using to allow remote memory access:

//On dev_0
Exchange info with device_1

//on dev_1:
Exchange info with device_0

The running code is the following:
//On dev_0
call cuda kernel
call cudaRecordEvent

//On dev_1
print the data on the stream