Hello,
I am working on a Jacobi-1D stencil with 2 GPU’s (where each device gets n/2+1 of the array). I originally had a host synchronized code, where I launch the computation kernels on both devices, wait for their completion, swap the pointers, do the halo exchange and repeat this in a loop (for t timesteps). To remove the host from this path, I have decided to use streams.
Before I show the code, here is what I do shortly:
for each timestep:
- launch computation kernel on device 0
- launch swap kernel on device 0
- launch computation kernel on device 1
- launch swap kernel on device 1
- issue memcpyPeerAsync for halo exchange on device 0
- issue memcpyPeerAsync for halo exchange on device 1
Now the thing is, before the memcpy, it seems that I do not require to synchronize the streams, I am full of question marks regarding how come this works without such explicit synchronization. I am aware of the implicit synchronization among streams, but they were on different streams in a single device. In this case, we have 2 streams, one stream per device. How are these two streams able to synchronize?
I have my code below: (the timestep loop)
for (int t = 0; t < tsteps; ++t) {
//Independent computation of device 0
CUDA_CHECK(cudaSetDevice(0));
compute_kernel<<<grid,block,0,d0_stream>>>(n/2+1, d0_A, d0_B); // issue computation kernel on device 0
CUDA_GET_LAST_ERR(1);
swap_kernel<<<grid,block,0,d0_stream>>>(n/2+1,d0_A,d0_B); // issue swap kernel on device 0
CUDA_GET_LAST_ERR(2);
// Independent computation of device 1
CUDA_CHECK(cudaSetDevice(1));
compute_kernel<<<grid,block,0,d1_stream>>>(n/2+1,d1_A,d1_B); // issue computation kernel on device 1
CUDA_GET_LAST_ERR(3);
swap_kernel<<<grid,block,0,d1_stream>>>(n/2+1,d1_A,d1_B); // issue swap kernel on device 1
CUDA_GET_LAST_ERR(4);
// Wait for these computations to finish (why does it work without these???)
//CUDA_CHECK(cudaSetDevice(0));
//CUDA_CHECK(cudaStreamSynchronize(d0_stream));
//CUDA_CHECK(cudaSetDevice(1));
//CUDA_CHECK(cudaStreamSynchronize(d1_stream));
// Issue halo exchange to the streams
// cudaMemcpyPeerAsync: dest ptr, dest device, src ptr, src device, count, stream
CUDA_CHECK(cudaSetDevice(0));
CUDA_CHECK(cudaMemcpyPeerAsync(&d0_A[n/2], 0, &d1_A[1],1, sizeof(double), d0_stream));
CUDA_CHECK(cudaSetDevice(1));
CUDA_CHECK(cudaMemcpyPeerAsync(&d1_A[0], 1, &d0_A[n/2-1], 0, sizeof(double), d1_stream));
}
Notice how I have commented the stream synchronization part. Regarding how I check the correctness, I run the same thing on CPU and compare the values.
I will also add my kernel code here:
__global__ void compute_kernel(int n, double* A, double* B) {
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if ((i > 0) && (i < (n-1))) {
B[i] = 0.33333f * (A[i-1] + A[i] + A[i+1]);
}
}
__global__ void swap_kernel(int n, double* A, double* B) {
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
double tmp = A[i]; A[i] = B[i]; B[i] = tmp;
}
}
That is all, thank you for time and effort in advance. Please write so if you want to me give more information.
Best,
Erhan
Edit: Formatting mistakes