Do i really need to use cudaDeviceSynchronize in this scenario ?

tajiknomi · February 6, 2019, 8:20am

Hi,

I have created two streams by the name stream0 and stream1.
The designe of my module is as following (ignore the syntax, please)

loop (iterations) {

offsetA = 2 * i * (uint64_t) BytesPerStream;
offsetB = ((2 * i) + 1) * (uint64_t) BytesPerStream;

cudaMemcpyAsync(...,Host2Dev,stream0)
cudaMemcpyAsync(...,Host2Dev,stream1)

/* POSITION 1 */

pointerA = PointerToFile + offsetA            
pointerB = PointerToFile + offsetB

kernel<,,,stream0>
kernel<,,,stream1>

cudaMemcpyAsync(...,Dev2Host,stream0)
cudaMemcpyAsync(...,Dev2Host,stream1)

cudaStreamAddCallback(stream0,...,pointerA)
cudaStreamAddCallback(stream1,...,pointerB)

 cudaDeviceSynchronize(); // Why i need this ???  
}

The module works fine only when i use cudaDeviceSynchronize() in the loop.

So here i would like to get my concepts clear.

Assume i have commented out cudaDeviceSynchronize().

When stream1 has finished executing the kernel<<<,stream1>>>, it will then call the cudaStreamAddCallback(stream1,…,pointerB), but the host and device can’t proceed to the next iteration until it finishes cudaStreamAddCallback(stream1,…,pointerB), right ?
But the behavior of my program is in contrast with my first statement. Because the host proceed to change the pointers values i.e pointerA and pointerB at POSITION 1 for the next iteration before finishing cudaStreamAddCallback(stream1,…,pointerB) . I want to modifying pointerA and pointerB for every iteration but doesn’t want to change it until both the callbacks finishes.

The reason i don’t want to use cudaDeviceSynchronize is because both streams read and write to their own memory locations and synchronizing at every iteration will effect the module performance.

Edit: I came to realize that the host can proceed to the next iteration because kernel<,stream1> is asynchronous w.r.t the host thread. So if kernel<,stream1> is executing in the iteration X, the host can move on to the X+1 iteration and change the pointers before cudaStreamAddCallback(stream1,…,pointerB) is called.

tajiknomi · February 11, 2019, 7:00am

Sorry for my late reply.

I haven’t tried dynamic parallelism yet. Actually the nature of my project is such that there are sections which contains some complex operations which i thought are better suited for the host unit; plus i’m already spilling some device registers so can’t afford more.
I tried your idea of cudaEventRecord (instead of cudaDeviceSynchronize) and it helped me out in saving some time (thanks for that).
My new design looks like this one

loop (iterations) {

offsetA = 2 * i * (uint64_t) BytesPerStream;
offsetB = ((2 * i) + 1) * (uint64_t) BytesPerStream;

cudaMemcpyAsync(...,Host2Dev,stream0)
cudaMemcpyAsync(...,Host2Dev,stream1)

/* POSITION 1 */

pointerA = PointerToFile + offsetA            
pointerB = PointerToFile + offsetB

kernel<,,,stream0>
kernel<,,,stream1>

cudaMemcpyAsync(...,Dev2Host,stream0)
cudaMemcpyAsync(...,Dev2Host,stream1)

cudaStreamAddCallback(stream0,...,pointerA)

HANDLE_ERROR( <b>cudaEventRecord</b>( start, stream1 ) );
cudaStreamAddCallback(stream1,...,pointerB)               // Recording only this event
HANDLE_ERROR( <b>cudaEventRecord</b>( stop, stream1 ) );

 HANDLE_ERROR( <b>cudaEventSynchronize</b>( stop ) ); 
}

I intended to use cudaMemcpyAsync execute for every loop to hide the copy to/from latency. Almost ~98% of time is spent on kernel execution.

cudaeducation · February 8, 2019, 3:07pm

Ok. Here are my initial thoughts.

You are correct in that the kernel calls are asynchronous, so the host loop will just proceed regardless of whether the kernel has finished executing, and it will also make a note to execute the callback once the kernel is finished (for the current iteration). As you correctly stated the problem is that the host thread may have already moved on to the next iteration and changed the pointer values before the callback in the previous iteration has executed.
I get the feeling that the entire algorithm should be rewritten using CUDA Dynamic Parallelism and memory fences. There needs to be a parent-child relationship going and an independent thread-based approach for activities in each stream. Having the host thread meddling with all your operations and what you are trying to achieve just introduces complications and a royal headache. Everything should be done on the GPU and you should separate stream0 from stream1 clearly in your code. Check out CUDA Dynamic Parallelism Tutorial with Code | Video Walkthrough (59 minutes) – Cuda Education
I think you can also use cudaEventRecord to get some separation between pairs of events. cudaDeviceSynchronize is too heavy-handed and all encompassing. cudaEventRecord allows you to be more granular and only pause when you absolutely need to for specific events. Check out https://cudaeducation.com/cudasynchronization/
Do you intend to have cudaMemcpyAsync execute for each and every loop? I would think you execute once and then have everything done on the GPU.

Hope this helps!

-Cuda Education

Topic		Replies	Views
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	245	April 1, 2024
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	840	February 1, 2024
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7872	January 5, 2016
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23205	October 7, 2010
Dynamic Parallelism, HyperQ and cudaDeviceSynchronize() CUDA Programming and Performance	1	1036	June 23, 2016
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16316	September 29, 2017
cudaLaunchHostFunc API example CUDA Programming and Performance	31	5976	February 8, 2025
CUDA stream management CUDA Programming and Performance	1	454	December 15, 2016
Computation and PCIe tranfers overlaping with callbacks and events. CUDA Programming and Performance	7	915	July 7, 2016
Unable to run kernel on device 1 with memory in device 2 CUDA Programming and Performance	10	915	January 24, 2017

Do i really need to use cudaDeviceSynchronize in this scenario ?

Related topics