I want to synchronize CUDA streams

I am currently performing asynchronous operations using five streams in CUDA.
I want to complete the code in a specific order: Memcpy - cudaMemcpy (Host to Device) - Kernel - cudaMemcpy (Device to Host) - Memcpy. I am encountering an issue where the cudaMemcpy (Device to Host) needs to be completed before I can access the destination address in the last Memcpy. I tried using stream synchronization, but encountered a problem where it breaks after one cycle.
Is there a way to stop each stream while maintaining asynchronous behavior, ensuring that cudaMemcpy (Device to Host) is completed before accessing Memcpy on the CPU?

The following is the result of monitoring the code with errors using Nsight.

The general stream method to ensure that operation B does not begin until operation A is complete is to launch them into the same stream.

You can also look at cudaStreamWaitEvent() as another option.

2 Likes
1 Like

I’ve tried almost everything there, but the issue persists. Asynchronous execution seems to work, but at the point of Stream synchronization, it appears as if the GPU is waiting for all Streams to complete. Should each Stream exist in a different thread.
I’m really curious.

I just resolved it using a callback function and cudaLaunchHostFunc ! Thank you very much

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.