CUDA implicit synchronization behavior and conditions in detail

wlol892 · April 28, 2023, 8:39pm

Hi,

The CUDA Programming Guide lists the following as actions that cause implicit synchronization:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations

What exactly is meant by “Synchronization” in this case? Let’s say you have 2 user-created streams, s1 and s2, and the following is performed:
a kernel call to s1
a kernel call to the NULL stream (i.e. no stream argument)
a kernel call to s2
Does execution on the host thread pause on the second kernel call until the first is finished, or is it somehow allowed to continue (but the second call is not executed until the first one is finished and the third until the second)? I would have thought that the first description is true, but I did some testing and it seems like it’s the latter. Furthermore, the answer here
https://forums.developer.nvidia.com/t/implicit-synchronization/61303
seems to imply it’s the second option: “It doesn’t necessarily mean that they become synchronous with the host”, but I might be misunderstanding something.
What falls under the definition of “any CUDA command to the NULL stream”? Is it any API call that takes a stream argument or is it any CUDA API call? I guess it’s the former, because otherwise “a device memory allocation” wouldn’t be listed.
Does “device memory set” mean Memset or Memcpy functions?

Robert_Crovella · April 28, 2023, 9:42pm

Synchronization means:

Work items (e.g. kernels, async copies, etc.) issued to the GPU before the synchronizing operation must all complete before the synchronizing operation is allowed to begin on the device side. It also means that all work items issued to the GPU after the synchronizing operation may not begin executing until the synchronizing operation is complete.

It tells you nothing about host thread behavior. Let me say that again: it tells you nothing about host thread behavior.

For example, a kernel launch is always asynchronous (I prefer to use the terms blocking and non-blocking to differentiate host description from device description, so asynchronous when used to describe host thread behavior means the operation is non-blocking with respect to the host thread.) The only exceptions I know of to this always statement are:

If you specify a blocking behavior using an environment variable.
If the launch queue is full (means you have issued hundreds or thousands of kernel calls that have not begun to execute yet).

Other than those cases, kernel launches are always non-blocking to the host thread.

Therefore, no it does not mean this:

Because:

kernel launches are always non-blocking to the host thread
The synchronization refers to the progress of work on the GPU, not the progress of the host thread

The nature of asynchronous work issuance to the GPU is that ordinarily, and with proper API choices, the CPU thread can “race ahead” of what is actually taking place on the GPU. The issued work goes into a queue, and gets dispatched to the GPU when stream semantics and GPU capacity allow.

Going back to a synchronizing operation then, it means that the CPU thread put a bunch of work into the queue. The issuance of work from the queue, to the GPU will be ordered in such a way, with respect to a synchronizing op in the queue, that all previous work in the queue must complete before the synchronizing op is allowed to begin executing, and all subsequent work in the queue will not be allowed to execute on the GPU until the previous synchronizing op is completed.

You can get more information on these topics via this online training as well as this DLI course.

It’s not well-defined. It definitely includes any API call that takes a stream parameter (when the supplied stream or defaulted stream is the NULL stream), and there are other commands that fall into this category, like cudaMalloc() (although that is explicitly covered previously in the list: “a device memory allocation”). But, for example, cudaSetDevice() is neither blocking nor synchronizing.

It means cudaMemset. cudaMemcpy always runs in the null stream, and cudaMemcpyAsync runs in whatever stream you specify. cudaMemcpy is blocking and synchronizing. cudaMemcpyAsync can have a variety of different blocking or synchronizing behaviors, depending on specifics of call.

Let’s go back to this statement:

It tells you nothing about host thread behavior.

Nothing means:

it does not tell you if the operation will be blocking
it does not tell you if the operation will be non-blocking.

For example, kernel launches, even issued into the null stream, are non-blocking. But they are synchronizing (if issued into the null stream).

AFAIK, cudaMalloc() is blocking (and synchronizing).

wlol892 · April 29, 2023, 9:43pm

Thank you for the detailed explanation!

system · May 13, 2023, 9:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	590	December 30, 2023
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	247	September 30, 2024
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7759	January 5, 2016
Implicit synchronization CUDA Programming and Performance	6	3575	April 30, 2015
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	535	August 23, 2023
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	999	December 15, 2022
A few new to CUDA questions CUDA Programming and Performance	3	1111	February 4, 2011
Synchronization methods? CUDA Programming and Performance	11	2093	November 7, 2010
Stream Synchronization Questions CUDA Programming and Performance	1	287	January 17, 2019
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	834	February 1, 2024

CUDA implicit synchronization behavior and conditions in detail

Related topics