I think It is a bug but I will expose the problem that I have found to be sure if It’s a bug or my mistake.
The DD used is ring. I mean the device ‘n’ is communicated with ‘n-1’ and ‘n+1’. The device 0 is related with 7 and viceversa.
The p2p mode is enabled. BUT… 4 by 4. That is there are 2 groups where the p2p can be activated (0,1,2,3) and (4,5,6,7). The communication inter-groups has to be done through host memory. In advance these nodes (0,3,4,7) will be called borders.
2 streams by GPU to:
- main = execute the intern volume
- sec = execute the halo volume and transfer the (ASYNC) halo volume
When I launch the async transfers in the ‘border’ devices, one of the p2p transfer should be splitted in 2 transfers: ‘D(i)-H’ and ‘H-D(i+1)’. These 2 transfers are NOT sync because the ‘D(i)-H’ is scheduled to the indicated stream (in my case stream sec) but the transfer ‘H-D(i+1)’ is scheduled in a new and system stream. And … they are not synchronized!!!
if some async transfer has to be splitted to D-H and H-D should be schedulled BOTH of them in the indicated stream
Edit: I have attached an screenshot of nvvp trying to explain this problem