Question about memory flush and synchronization memory flush and synchronization

lee222 · July 20, 2008, 8:23pm

In CUDA model, global synchronizations among blocks are not allowed, and thus, to get a effect of global synchronization, one solution is to return a kernel. My question is this: if a kernel returns, does it guarantee “flush” for data in read/write buffers?
Suppose that there are two kernels: kernel1 and kernel2. kernel1 writes so many global data that write buffers become full. If kernel1 returns and kernel2 is launched, can the kernel2 see all the latest data written by kernel1?

SPWorley · July 20, 2008, 9:26pm

I think you’re asking if the global memory writes of kernel 1 will be flushed and guaranteed to be visible to kernel 2.

No, they aren’t… but that’s completely under your control. On the host side, if you want to guarantee that the kernels are run sequentially, you’d likely launch kernel 1, then do a host-side cudaThreadSyncronize(), then launch kernel 2. Then your desired guarantee of kernel 2 seeing all the device memory effects from kernel 1 are guaranteed. If you didn’t need to sync the kernels, you could just call kernel 1 and 2 and the device would run them asynchronously.

Most times the Async calls are dealing with getting memory to and from the device between kernel evocations, but it sounds like in your case you want to make sure that the kernels are executed disjointly, which is actually even simpler.

For even finer control of compute and memory orders, you can set up streams… look at the programming guide for examples.

seibert · July 21, 2008, 12:58am

I think you’re asking if the global memory writes of kernel 1 will be flushed and guaranteed to be visible to kernel 2.

No, they aren’t… but that’s completely under your control. On the host side, if you want to guarantee that the kernels are run sequentially, you’d likely launch kernel 1, then do a host-side cudaThreadSyncronize(), then launch kernel 2. Then your desired guarantee of kernel 2 seeing all the device memory effects from kernel 1 are guaranteed. If you didn’t need to sync the kernels, you could just call kernel 1 and 2 and the device would run them asynchronously.

Most times the Async calls are dealing with getting memory to and from the device between kernel evocations, but it sounds like in your case you want to make sure that the kernels are executed disjointly, which is actually even simpler.

For even finer control of compute and memory orders, you can set up streams… look at the programming guide for examples.

[snapback]413564[/snapback]

Is this really true? From the Programming Guide (4.5.1.5):

"Any kernel launch, memory set, or memory copy function without a stream

parameter or with a zero stream parameter begins only after all preceding operations

are done, including operations that are part of streams, and no subsequent operation

may begin until it is done."

I have interpreted this to mean that writes to global memory are completed as part of the kernel launch completing.

SPWorley · July 21, 2008, 3:01am

Whoops, my mistake. I keep thinking of the common case where host function launches are asynchronous… they are, but to the host. lee222 is worried about what launches look like to the device. Both kernels are running with implied stream #0 and are therefore sequentially run.

Thanks for catching my error, seibert!

so lee222, the corrected answer is, if you use the same stream for both kernel launches, you’ll be guaranteed no execution overlap and therefore all your memory writes from kernel #1 will indeed be seen by kernel #2.

You don’t need the cudaThreadSyncronize().

lee222 · July 21, 2008, 5:12am

Thanks.

Then can I interpret the last sentence, “You don’t need the cudaThreadsSyncronize()” as follows?

“If I use kernels with zero stream parameter, global memory flush is guaranteed, and thus I don’t need cudaThreadSyncronize()”

tmurray · July 21, 2008, 5:33am

That is correct. Operations within streams are serialized.

lee222 · July 23, 2008, 4:38am

Thanks.

Topic		Replies	Views
Synchronization methods? CUDA Programming and Performance	11	2064	November 7, 2010
How to synchronize between two kernels using CUDA? CUDA Programming and Performance	2	29	November 23, 2024
CUDA Memory Consistency CUDA Programming and Performance	23	55421	March 8, 2007
Continuing global memory output between kernels CUDA Programming and Performance	2	489	August 23, 2019
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	3	1477	April 29, 2023
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9715	September 22, 2007
DMA to global memory while a kernel is running ? CUDA Programming and Performance	7	2073	December 19, 2008
Thread safety of reading and writing different area of constant memory in multiple concurrently executed kernels? CUDA Programming and Performance	10	969	March 27, 2023
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4565	February 6, 2009
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17404	June 14, 2008

Question about memory flush and synchronization memory flush and synchronization

Related topics