How to synchronize between two kernels using CUDA?

1191011962 · November 23, 2024, 2:24pm

I currently have two kernels running in different CUDA streams, as follows:

__global__ void kernel1(int *transfer_array)
{
    // Instruction a: Store data into global memory `transfer_array`
    // Instruction b
}

__global__ void kernel2(int *transfer_array)
{
    // Instruction c
    // Instruction d: Read the data stored by instruction `a` in global memory `transfer_array`
}

int main()
{
    cudaStream_t stream1, stream2;
    int *transfer_array;
    cudaMalloc((void **)&transfer_array, 5 * sizeof(int));
    
    // Create two CUDA streams
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    
    // Launch kernel1 on stream1
    kernel1<<<100, 1000, 0, stream1>>>(transfer_array);
    // Launch kernel2 on stream2
    kernel2<<<100, 1000, 0, stream2>>>(transfer_array);

    // Synchronize streams to ensure kernels have finished executing
    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    // Destroy streams
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaFree(transfer_array);
    return 0;
}

In kernel2, instruction d needs to read the data written by instruction a in kernel1 to global memory transfer_array. In other words, instruction d can only execute after all blocks of kernel1 have completed instruction a.

My Attempts

Using __threadfence() before instruction d: I found that it does not work in this case.
Using atomic operations or semaphores for synchronization: This also did not work reliably and sometimes caused deadlocks.
Using cudaDeviceSynchronize() after kernel1: This method ensures kernel1 finishes completely before kernel2 starts. However, this forces serialization between the two kernels, defeating the purpose of using CUDA streams for potential parallelism.

Due to certain reliability requirements, these two kernels must run separately and cannot be merged into a single kernel.

My Question

Is there an effective way to achieve synchronization between kernels while still allowing for potential overlap in execution (i.e., allowing kernel1 and kernel2 to execute concurrently where possible)? Could you provide an example? Thank you!

striker159 · November 23, 2024, 4:39pm

There is no guarantee that two kernels submitted to different streams are executed concurrently, and your code must not depend on it.

You could split the kernels into kernel1_a, kernel1_b, kernel2_c, kernel2_d , and use cudaStreamWaitEvent to specify the dependencies between the kernels.

Curefab · November 23, 2024, 4:48pm

Even if it can be done somehow with global device memory variables, would there be a guarantee that all the writes are seen in the correct order in a different kernel?

Topic		Replies	Views
Concurrent kernel execution CUDA Programming and Performance	2	257	March 26, 2024
Question about CUDA streams CUDA Programming and Performance	8	728	November 8, 2019
CUDA stream management CUDA Programming and Performance	1	452	December 15, 2016
How can PTX instructions be used to synchronize instructions between two kernels? CUDA Programming and Performance	7	41	November 15, 2024
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	991	February 11, 2019
A few new to CUDA questions CUDA Programming and Performance	3	1110	February 4, 2011
same kernel on different data CUDA Programming and Performance	3	1415	November 17, 2008
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4565	February 6, 2009
Question about memory flush and synchronization memory flush and synchronization CUDA Programming and Performance	6	4710	July 23, 2008
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	824	February 1, 2024

How to synchronize between two kernels using CUDA?

My Attempts

My Question

Related topics