On implicit synchronization of streams on separate devices

Hello,

I am working on a Jacobi-1D stencil with 2 GPU’s (where each device gets n/2+1 of the array). I originally had a host synchronized code, where I launch the computation kernels on both devices, wait for their completion, swap the pointers, do the halo exchange and repeat this in a loop (for t timesteps). To remove the host from this path, I have decided to use streams.

Before I show the code, here is what I do shortly:
for each timestep:

  • launch computation kernel on device 0
  • launch swap kernel on device 0
  • launch computation kernel on device 1
  • launch swap kernel on device 1
  • issue memcpyPeerAsync for halo exchange on device 0
  • issue memcpyPeerAsync for halo exchange on device 1

Now the thing is, before the memcpy, it seems that I do not require to synchronize the streams, I am full of question marks regarding how come this works without such explicit synchronization. I am aware of the implicit synchronization among streams, but they were on different streams in a single device. In this case, we have 2 streams, one stream per device. How are these two streams able to synchronize?

I have my code below: (the timestep loop)

for (int t = 0; t < tsteps; ++t) {
    //Independent computation of device 0
    CUDA_CHECK(cudaSetDevice(0));
    compute_kernel<<<grid,block,0,d0_stream>>>(n/2+1, d0_A, d0_B); // issue computation kernel on device 0
    CUDA_GET_LAST_ERR(1);
    swap_kernel<<<grid,block,0,d0_stream>>>(n/2+1,d0_A,d0_B); // issue swap kernel on device 0
    CUDA_GET_LAST_ERR(2);

    // Independent computation of device 1
    CUDA_CHECK(cudaSetDevice(1));
    compute_kernel<<<grid,block,0,d1_stream>>>(n/2+1,d1_A,d1_B); // issue computation kernel on device 1
    CUDA_GET_LAST_ERR(3);
    swap_kernel<<<grid,block,0,d1_stream>>>(n/2+1,d1_A,d1_B); // issue swap kernel on device 1
    CUDA_GET_LAST_ERR(4);

    // Wait for these computations to finish (why does it work without these???)
    //CUDA_CHECK(cudaSetDevice(0));
    //CUDA_CHECK(cudaStreamSynchronize(d0_stream));
    //CUDA_CHECK(cudaSetDevice(1));
    //CUDA_CHECK(cudaStreamSynchronize(d1_stream));

    // Issue halo exchange to the streams
    // cudaMemcpyPeerAsync: dest ptr, dest device, src ptr, src device, count, stream
    CUDA_CHECK(cudaSetDevice(0)); 
    CUDA_CHECK(cudaMemcpyPeerAsync(&d0_A[n/2], 0, &d1_A[1],1, sizeof(double), d0_stream)); 
    CUDA_CHECK(cudaSetDevice(1)); 
    CUDA_CHECK(cudaMemcpyPeerAsync(&d1_A[0],   1, &d0_A[n/2-1], 0, sizeof(double), d1_stream)); 
}

Notice how I have commented the stream synchronization part. Regarding how I check the correctness, I run the same thing on CPU and compare the values.
I will also add my kernel code here:

__global__ void compute_kernel(int n, double* A, double* B) {
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if ((i > 0) && (i < (n-1))) {
        B[i] = 0.33333f * (A[i-1] + A[i] + A[i+1]);
    }
}

__global__ void swap_kernel(int n, double* A, double* B) {
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        double tmp = A[i]; A[i] = B[i]; B[i] = tmp;
    }
}

That is all, thank you for time and effort in advance. Please write so if you want to me give more information.

Best,
Erhan

Edit: Formatting mistakes

What happens in:

CUDA_GET_LAST_ERR(4);

?

Hello,

I have two macros for error checking:

#define CUDA_CHECK(call) \
    if((call) != cudaSuccess) { \
        cudaError_t err = cudaGetLastError(); \
        printf("CUDA error calling method \""#call"\" - err: %s\n", cudaGetErrorString(err)); \
    }

#define CUDA_GET_LAST_ERR(verbose_num) \
    GLOBAL_ERR = cudaGetLastError(); \
    if ( cudaSuccess != GLOBAL_ERR ){ \
        fprintf(stderr, "%d-cudaCheckError() failed: %s\n", verbose_num, cudaGetErrorString(GLOBAL_ERR)); \
    }

So in that case if there was an error we would know it happened on the macro where we gave 4 as the parameter.