I am currently performing asynchronous operations using five streams in CUDA.
I want to complete the code in a specific order: Memcpy - cudaMemcpy (Host to Device) - Kernel - cudaMemcpy (Device to Host) - Memcpy. I am encountering an issue where the cudaMemcpy (Device to Host) needs to be completed before I can access the destination address in the last Memcpy. I tried using stream synchronization, but encountered a problem where it breaks after one cycle.
Is there a way to stop each stream while maintaining asynchronous behavior, ensuring that cudaMemcpy (Device to Host) is completed before accessing Memcpy on the CPU?
The following is the result of monitoring the code with errors using Nsight.
I’ve tried almost everything there, but the issue persists. Asynchronous execution seems to work, but at the point of Stream synchronization, it appears as if the GPU is waiting for all Streams to complete. Should each Stream exist in a different thread.
I’m really curious.