How to synchronize between two kernels using CUDA?

I currently have two kernels running in different CUDA streams, as follows:

__global__ void kernel1(int *transfer_array)
{
    // Instruction a: Store data into global memory `transfer_array`
    // Instruction b
}

__global__ void kernel2(int *transfer_array)
{
    // Instruction c
    // Instruction d: Read the data stored by instruction `a` in global memory `transfer_array`
}

int main()
{
    cudaStream_t stream1, stream2;
    int *transfer_array;
    cudaMalloc((void **)&transfer_array, 5 * sizeof(int));
    
    // Create two CUDA streams
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    
    // Launch kernel1 on stream1
    kernel1<<<100, 1000, 0, stream1>>>(transfer_array);
    // Launch kernel2 on stream2
    kernel2<<<100, 1000, 0, stream2>>>(transfer_array);

    // Synchronize streams to ensure kernels have finished executing
    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    // Destroy streams
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaFree(transfer_array);
    return 0;
}

In kernel2, instruction d needs to read the data written by instruction a in kernel1 to global memory transfer_array. In other words, instruction d can only execute after all blocks of kernel1 have completed instruction a.


My Attempts

  1. Using __threadfence() before instruction d: I found that it does not work in this case.

  2. Using atomic operations or semaphores for synchronization: This also did not work reliably and sometimes caused deadlocks.

  3. Using cudaDeviceSynchronize() after kernel1: This method ensures kernel1 finishes completely before kernel2 starts. However, this forces serialization between the two kernels, defeating the purpose of using CUDA streams for potential parallelism.

Due to certain reliability requirements, these two kernels must run separately and cannot be merged into a single kernel.


My Question

Is there an effective way to achieve synchronization between kernels while still allowing for potential overlap in execution (i.e., allowing kernel1 and kernel2 to execute concurrently where possible)? Could you provide an example? Thank you!

There is no guarantee that two kernels submitted to different streams are executed concurrently, and your code must not depend on it.

You could split the kernels into kernel1_a, kernel1_b, kernel2_c, kernel2_d , and use cudaStreamWaitEvent to specify the dependencies between the kernels.

Even if it can be done somehow with global device memory variables, would there be a guarantee that all the writes are seen in the correct order in a different kernel?