Question about memory flush and synchronization memory flush and synchronization

In CUDA model, global synchronizations among blocks are not allowed, and thus, to get a effect of global synchronization, one solution is to return a kernel. My question is this: if a kernel returns, does it guarantee “flush” for data in read/write buffers?
Suppose that there are two kernels: kernel1 and kernel2. kernel1 writes so many global data that write buffers become full. If kernel1 returns and kernel2 is launched, can the kernel2 see all the latest data written by kernel1?

I think you’re asking if the global memory writes of kernel 1 will be flushed and guaranteed to be visible to kernel 2.

No, they aren’t… but that’s completely under your control. On the host side, if you want to guarantee that the kernels are run sequentially, you’d likely launch kernel 1, then do a host-side cudaThreadSyncronize(), then launch kernel 2. Then your desired guarantee of kernel 2 seeing all the device memory effects from kernel 1 are guaranteed. If you didn’t need to sync the kernels, you could just call kernel 1 and 2 and the device would run them asynchronously.

Most times the Async calls are dealing with getting memory to and from the device between kernel evocations, but it sounds like in your case you want to make sure that the kernels are executed disjointly, which is actually even simpler.

For even finer control of compute and memory orders, you can set up streams… look at the programming guide for examples.

Is this really true? From the Programming Guide (4.5.1.5):

"Any kernel launch, memory set, or memory copy function without a stream

parameter or with a zero stream parameter begins only after all preceding operations

are done, including operations that are part of streams, and no subsequent operation

may begin until it is done."

I have interpreted this to mean that writes to global memory are completed as part of the kernel launch completing.

Whoops, my mistake. I keep thinking of the common case where host function launches are asynchronous… they are, but to the host. lee222 is worried about what launches look like to the device. Both kernels are running with implied stream #0 and are therefore sequentially run.

Thanks for catching my error, seibert!

so lee222, the corrected answer is, if you use the same stream for both kernel launches, you’ll be guaranteed no execution overlap and therefore all your memory writes from kernel #1 will indeed be seen by kernel #2.

You don’t need the cudaThreadSyncronize().

Thanks.

Then can I interpret the last sentence, “You don’t need the cudaThreadsSyncronize()” as follows?

“If I use kernels with zero stream parameter, global memory flush is guaranteed, and thus I don’t need cudaThreadSyncronize()”

That is correct. Operations within streams are serialized.

Thanks.