Do i really need to use cudaDeviceSynchronize in this scenario ?

Hi,

I have created two streams by the name stream0 and stream1.
The designe of my module is as following (ignore the syntax, please)

loop (iterations) {

offsetA = 2 * i * (uint64_t) BytesPerStream;
offsetB = ((2 * i) + 1) * (uint64_t) BytesPerStream;

cudaMemcpyAsync(...,Host2Dev,stream0)
cudaMemcpyAsync(...,Host2Dev,stream1)

/* POSITION 1 */

pointerA = PointerToFile + offsetA            
pointerB = PointerToFile + offsetB

kernel<,,,stream0>
kernel<,,,stream1>

cudaMemcpyAsync(...,Dev2Host,stream0)
cudaMemcpyAsync(...,Dev2Host,stream1)

cudaStreamAddCallback(stream0,...,pointerA)
cudaStreamAddCallback(stream1,...,pointerB)

 cudaDeviceSynchronize(); // Why i need this ???  
}

The module works fine only when i use cudaDeviceSynchronize() in the loop.

So here i would like to get my concepts clear.

Assume i have commented out cudaDeviceSynchronize().

  1. When stream1 has finished executing the kernel<<<,stream1>>>, it will then call the cudaStreamAddCallback(stream1,…,pointerB), but the host and device can’t proceed to the next iteration until it finishes cudaStreamAddCallback(stream1,…,pointerB), right ?

  2. But the behavior of my program is in contrast with my first statement. Because the host proceed to change the pointers values i.e pointerA and pointerB at POSITION 1 for the next iteration before finishing cudaStreamAddCallback(stream1,…,pointerB) . I want to modifying pointerA and pointerB for every iteration but doesn’t want to change it until both the callbacks finishes.

The reason i don’t want to use cudaDeviceSynchronize is because both streams read and write to their own memory locations and synchronizing at every iteration will effect the module performance.

Edit: I came to realize that the host can proceed to the next iteration because kernel<,stream1> is asynchronous w.r.t the host thread. So if kernel<,stream1> is executing in the iteration X, the host can move on to the X+1 iteration and change the pointers before cudaStreamAddCallback(stream1,…,pointerB) is called.

Sorry for my late reply.

  1. I haven’t tried dynamic parallelism yet. Actually the nature of my project is such that there are sections which contains some complex operations which i thought are better suited for the host unit; plus i’m already spilling some device registers so can’t afford more.

  2. I tried your idea of cudaEventRecord (instead of cudaDeviceSynchronize) and it helped me out in saving some time (thanks for that).
    My new design looks like this one

loop (iterations) {

offsetA = 2 * i * (uint64_t) BytesPerStream;
offsetB = ((2 * i) + 1) * (uint64_t) BytesPerStream;

cudaMemcpyAsync(...,Host2Dev,stream0)
cudaMemcpyAsync(...,Host2Dev,stream1)

/* POSITION 1 */

pointerA = PointerToFile + offsetA            
pointerB = PointerToFile + offsetB

kernel<,,,stream0>
kernel<,,,stream1>

cudaMemcpyAsync(...,Dev2Host,stream0)
cudaMemcpyAsync(...,Dev2Host,stream1)

cudaStreamAddCallback(stream0,...,pointerA)

HANDLE_ERROR( <b>cudaEventRecord</b>( start, stream1 ) );
cudaStreamAddCallback(stream1,...,pointerB)               // Recording only this event
HANDLE_ERROR( <b>cudaEventRecord</b>( stop, stream1 ) );

 HANDLE_ERROR( <b>cudaEventSynchronize</b>( stop ) ); 
}
  1. I intended to use cudaMemcpyAsync execute for every loop to hide the copy to/from latency. Almost ~98% of time is spent on kernel execution.

Ok. Here are my initial thoughts.

  1. You are correct in that the kernel calls are asynchronous, so the host loop will just proceed regardless of whether the kernel has finished executing, and it will also make a note to execute the callback once the kernel is finished (for the current iteration). As you correctly stated the problem is that the host thread may have already moved on to the next iteration and changed the pointer values before the callback in the previous iteration has executed.

  2. I get the feeling that the entire algorithm should be rewritten using CUDA Dynamic Parallelism and memory fences. There needs to be a parent-child relationship going and an independent thread-based approach for activities in each stream. Having the host thread meddling with all your operations and what you are trying to achieve just introduces complications and a royal headache. Everything should be done on the GPU and you should separate stream0 from stream1 clearly in your code. Check out CUDA Dynamic Parallelism Tutorial with Code | Video Walkthrough (59 minutes) – Cuda Education

  3. I think you can also use cudaEventRecord to get some separation between pairs of events. cudaDeviceSynchronize is too heavy-handed and all encompassing. cudaEventRecord allows you to be more granular and only pause when you absolutely need to for specific events. Check out https://cudaeducation.com/cudasynchronization/

  4. Do you intend to have cudaMemcpyAsync execute for each and every loop? I would think you execute once and then have everything done on the GPU.

Hope this helps!

-Cuda Education