are there resource races peer2peer copy with compute kernels?

In my scenarios, I need to issue compute and ring-based peer2peer copy between different devices concurrently.

Following is my simplified application logics:

for (int t = 0; t < max_issue_times; ++ t) {
       for (int d = 0; d < NUM_DEVICES; ++ d) {
        cudaSetDevice(d);
        matrixMultiply(d, compute_stream_[d]);
       }
}

// some where else
for (int t = 0; t < max_issue_times; ++ t) {
// ring based peer to peer copy
       for (int d = 0; d < NUM_DEVICES; ++ d) {
           int f = (d + 1) % NUM_DEVICES;
          cudaSetDevice(d);
          cudaMemcpyPeerAsync(dst_buff[f], f,
                                      src_buff[d], d,
                                      buff_size, copy_streams_[d]));
       }
}

Using timeline, I found a lot of P2P copy kernels are force to run after all matrix kernels finished.
But all H2D and D2H memcpy run concurrently well. (Since cudaMemcpyPeerAsync will actually do H2D and D2H copy when peer access is disable between devices.)

I have 8 Tesla M40 gpu devices with p2p connective as follow:
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0 0
2 1 1 1 1 0 0 0 0
3 1 1 1 1 0 0 0 0
4 0 0 0 0 1 1 1 1
5 0 0 0 0 1 1 1 1
6 0 0 0 0 1 1 1 1
7 0 0 0 0 1 1 1 1

My question are there resource races peer2peer copy with compute kernels? Why a lot of P2P copy kernels are force to run after all matrix kernels finished?