Is cudaMemset actually "asynchronous"?


I just read from CUDA API doc that “The synchronous memset functions are asynchronous with respect to the host except when the target is pinned host memory or a Unified Memory region”

Here’s the link:

I was shocked. Does it mean that we actually need to do a cudaDeviceSynchronize() after a cudaMemset() call?


No, it means they may be asynchronous with respect to the host. That means it may be like a kernel call. Since cudaMemset (normally **) has no bearing on any host data, this should not matter.

All CUDA calls issued to a particular stream will be executed in order, with respect to other CUDA activity issued to the same stream.

Therefore if you do a cudaMemset, followed by a kernel call, both in the same stream (or both to the default stream) you can be assured that all of the results of the cudaMemset operation will be visible by any kernel activity.

** If the target of the cudaMemset operation is either pinned host memory, or a unified memory region, then either of these are visible to host code. In that situation, the stated asynchronous behavior does not apply, with respect to the host, to preserve sensible program semantics. In those cases, the cudaMemset operation should not return until the memset operation is complete, because the affected data is host-visible. Therefore subsequent host code should be able to “see” the effect of the cudaMemset operation, as it is not asynchronous in that case. It is in effect blocking, with respect to the host thread.

no cudaDeviceSynchronize() should be needed in any of the above cases (with respect to the cudaMemset operation – use of unified memory may require a cudaDeviceSynchronize() after kernel execution, so that unified data is again “visible” to the host, but this aside has no bearing on the stated behavior of the cudaMemset operation).

Yes, but actually my program has multiple threads, each launching kernels in separate streams. Suppose thread-A calls cudaMemset() and signals thread-B to run, and then thread-B launches a kernel (on the same data) in a non-default stream called stream-B. The asynchronous behavior of cudaMemset() would cause a problem.

I think I would change my cudaMemset() to cudaMemsetAsync() and let it run in a non-default stream (maybe stream-A), and then explicitly synchronize stream-A.

An alternative way would be to run cudaMemsetAsync() in stream-B, so it will guarantee to finish before the launching of following kernels.

I think in either ways, cudaMemsetAsync() should be used instead. Am I correct?


First of all, you shouldn’t really ever expect synchronous behavior between CUDA operations issued to different streams. That is not good CUDA programming practice, IMO. The whole point of separate streams is to de-synchronize activity.

I’m not sure that is the case.

Your non-async cudaMemset operation is issued to the default stream. Issuing an operation to the default stream has a device-wide synchronizing effect:

“any CUDA command to the NULL stream,”

assuming you have not modified the behavior of the default stream.

But considering all these gyrations is not the right way to go in my opinion. It’s not a sensible way to use the available APIs.

For a multi-streamed application, I would always use the async APIs. Trying to figure out the impact of the non-async API call on the behavior of a streamed application is not worth it.

Based on my understanding, cudaMemset() would synchronize all streams at the time when it’s issued, but the control could be returned to host code before it’s finished. In that sense, it’s both “synchronous” (will synchronize all streams when issued) and “asynchronous” (will return before finished). What a weird design!

Totally agree. I have this cudaMemset() call because I was modifying my program from a single-stream one to a multi-stream one, and somehow I forgot to change this function. I hope changing it to cudaMemsetAsync() could fix my bug in the other thread.

Thanks you!


The other aspect you may want to examine is whether a memset() operation is actually required. In my experience there is rarely a need for such bulk initialization, whether on the host or the device.