CUDA implicit synchronization behavior and conditions in detail

Hi,

The CUDA Programming Guide lists the following as actions that cause implicit synchronization:

  • a page-locked host memory allocation,
  • a device memory allocation,
  • a device memory set,
  • a memory copy between two addresses to the same device memory,
  • any CUDA command to the NULL stream,
  • a switch between the L1/shared memory configurations
  1. What exactly is meant by “Synchronization” in this case? Let’s say you have 2 user-created streams, s1 and s2, and the following is performed:
    a kernel call to s1
    a kernel call to the NULL stream (i.e. no stream argument)
    a kernel call to s2
    Does execution on the host thread pause on the second kernel call until the first is finished, or is it somehow allowed to continue (but the second call is not executed until the first one is finished and the third until the second)? I would have thought that the first description is true, but I did some testing and it seems like it’s the latter. Furthermore, the answer here
    https://forums.developer.nvidia.com/t/implicit-synchronization/61303
    seems to imply it’s the second option: “It doesn’t necessarily mean that they become synchronous with the host”, but I might be misunderstanding something.

  2. What falls under the definition of “any CUDA command to the NULL stream”? Is it any API call that takes a stream argument or is it any CUDA API call? I guess it’s the former, because otherwise “a device memory allocation” wouldn’t be listed.

  3. Does “device memory set” mean Memset or Memcpy functions?

Synchronization means:

Work items (e.g. kernels, async copies, etc.) issued to the GPU before the synchronizing operation must all complete before the synchronizing operation is allowed to begin on the device side. It also means that all work items issued to the GPU after the synchronizing operation may not begin executing until the synchronizing operation is complete.

It tells you nothing about host thread behavior. Let me say that again: it tells you nothing about host thread behavior.

For example, a kernel launch is always asynchronous (I prefer to use the terms blocking and non-blocking to differentiate host description from device description, so asynchronous when used to describe host thread behavior means the operation is non-blocking with respect to the host thread.) The only exceptions I know of to this always statement are:

  • If you specify a blocking behavior using an environment variable.
  • If the launch queue is full (means you have issued hundreds or thousands of kernel calls that have not begun to execute yet).

Other than those cases, kernel launches are always non-blocking to the host thread.

Therefore, no it does not mean this:

Because:

  1. kernel launches are always non-blocking to the host thread
  2. The synchronization refers to the progress of work on the GPU, not the progress of the host thread

The nature of asynchronous work issuance to the GPU is that ordinarily, and with proper API choices, the CPU thread can “race ahead” of what is actually taking place on the GPU. The issued work goes into a queue, and gets dispatched to the GPU when stream semantics and GPU capacity allow.

Going back to a synchronizing operation then, it means that the CPU thread put a bunch of work into the queue. The issuance of work from the queue, to the GPU will be ordered in such a way, with respect to a synchronizing op in the queue, that all previous work in the queue must complete before the synchronizing op is allowed to begin executing, and all subsequent work in the queue will not be allowed to execute on the GPU until the previous synchronizing op is completed.

You can get more information on these topics via this online training as well as this DLI course.

It’s not well-defined. It definitely includes any API call that takes a stream parameter (when the supplied stream or defaulted stream is the NULL stream), and there are other commands that fall into this category, like cudaMalloc() (although that is explicitly covered previously in the list: “a device memory allocation”). But, for example, cudaSetDevice() is neither blocking nor synchronizing.

It means cudaMemset. cudaMemcpy always runs in the null stream, and cudaMemcpyAsync runs in whatever stream you specify. cudaMemcpy is blocking and synchronizing. cudaMemcpyAsync can have a variety of different blocking or synchronizing behaviors, depending on specifics of call.

Let’s go back to this statement:

It tells you nothing about host thread behavior.

Nothing means:

  • it does not tell you if the operation will be blocking
  • it does not tell you if the operation will be non-blocking.

For example, kernel launches, even issued into the null stream, are non-blocking. But they are synchronizing (if issued into the null stream).

AFAIK, cudaMalloc() is blocking (and synchronizing).

2 Likes

Thank you for the detailed explanation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.