Synchronization means:
Work items (e.g. kernels, async copies, etc.) issued to the GPU before the synchronizing operation must all complete before the synchronizing operation is allowed to begin on the device side. It also means that all work items issued to the GPU after the synchronizing operation may not begin executing until the synchronizing operation is complete.
It tells you nothing about host thread behavior. Let me say that again: it tells you nothing about host thread behavior.
For example, a kernel launch is always asynchronous (I prefer to use the terms blocking and non-blocking to differentiate host description from device description, so asynchronous when used to describe host thread behavior means the operation is non-blocking with respect to the host thread.) The only exceptions I know of to this always statement are:
- If you specify a blocking behavior using an environment variable.
- If the launch queue is full (means you have issued hundreds or thousands of kernel calls that have not begun to execute yet).
Other than those cases, kernel launches are always non-blocking to the host thread.
Therefore, no it does not mean this:
Because:
- kernel launches are always non-blocking to the host thread
- The synchronization refers to the progress of work on the GPU, not the progress of the host thread
The nature of asynchronous work issuance to the GPU is that ordinarily, and with proper API choices, the CPU thread can “race ahead” of what is actually taking place on the GPU. The issued work goes into a queue, and gets dispatched to the GPU when stream semantics and GPU capacity allow.
Going back to a synchronizing operation then, it means that the CPU thread put a bunch of work into the queue. The issuance of work from the queue, to the GPU will be ordered in such a way, with respect to a synchronizing op in the queue, that all previous work in the queue must complete before the synchronizing op is allowed to begin executing, and all subsequent work in the queue will not be allowed to execute on the GPU until the previous synchronizing op is completed.
You can get more information on these topics via this online training as well as this DLI course.
It’s not well-defined. It definitely includes any API call that takes a stream parameter (when the supplied stream or defaulted stream is the NULL stream), and there are other commands that fall into this category, like cudaMalloc()
(although that is explicitly covered previously in the list: “a device memory allocation”). But, for example, cudaSetDevice()
is neither blocking nor synchronizing.
It means cudaMemset
. cudaMemcpy
always runs in the null stream, and cudaMemcpyAsync
runs in whatever stream you specify. cudaMemcpy
is blocking and synchronizing. cudaMemcpyAsync
can have a variety of different blocking or synchronizing behaviors, depending on specifics of call.
Let’s go back to this statement:
It tells you nothing about host thread behavior.
Nothing means:
- it does not tell you if the operation will be blocking
- it does not tell you if the operation will be non-blocking.
For example, kernel launches, even issued into the null stream, are non-blocking. But they are synchronizing (if issued into the null stream).
AFAIK, cudaMalloc()
is blocking (and synchronizing).