error using p2p and async transfers with 8 K20 (1 node)

I think It is a bug but I will expose the problem that I have found to be sure if It’s a bug or my mistake.

The DD used is ring. I mean the device ‘n’ is communicated with ‘n-1’ and ‘n+1’. The device 0 is related with 7 and viceversa.

The p2p mode is enabled. BUT… 4 by 4. That is there are 2 groups where the p2p can be activated (0,1,2,3) and (4,5,6,7). The communication inter-groups has to be done through host memory. In advance these nodes (0,3,4,7) will be called borders.

2 streams by GPU to:

  • main = execute the intern volume
  • sec = execute the halo volume and transfer the (ASYNC) halo volume

When I launch the async transfers in the ‘border’ devices, one of the p2p transfer should be splitted in 2 transfers: ‘D(i)-H’ and ‘H-D(i+1)’. These 2 transfers are NOT sync because the ‘D(i)-H’ is scheduled to the indicated stream (in my case stream sec) but the transfer ‘H-D(i+1)’ is scheduled in a new and system stream. And … they are not synchronized!!!

if some async transfer has to be splitted to D-H and H-D should be schedulled BOTH of them in the indicated stream

Edit: I have attached an screenshot of nvvp trying to explain this problem

I can’t create a bug. It seems there is an error. If someone on the staff can help me to create it, i will appreciate it

It’s not clear what you need help with, or what is not working. If the problem is as simple as you state (the D->H and H->D halves of a D->D transfer are getting out of sync) I would think that would be easy to demonstrate in a very short piece of code. Can you create a simple example?

The problem is that the second copy is launched before the first one. That is a race condition.

If you think that with a example is more clear the problem I will try to do it today

o kit’ was my fault. I think it was fixed