Is cudaMemset actually "asynchronous"?

Cui · January 5, 2016, 5:29pm

Hi,

I just read from CUDA API doc that “The synchronous memset functions are asynchronous with respect to the host except when the target is pinned host memory or a Unified Memory region”

Here’s the link:
http://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memset

I was shocked. Does it mean that we actually need to do a cudaDeviceSynchronize() after a cudaMemset() call?

Thanks,
Cui

Robert_Crovella · January 5, 2016, 5:55pm

No, it means they may be asynchronous with respect to the host. That means it may be like a kernel call. Since cudaMemset (normally **) has no bearing on any host data, this should not matter.

All CUDA calls issued to a particular stream will be executed in order, with respect to other CUDA activity issued to the same stream.

Therefore if you do a cudaMemset, followed by a kernel call, both in the same stream (or both to the default stream) you can be assured that all of the results of the cudaMemset operation will be visible by any kernel activity.

** If the target of the cudaMemset operation is either pinned host memory, or a unified memory region, then either of these are visible to host code. In that situation, the stated asynchronous behavior does not apply, with respect to the host, to preserve sensible program semantics. In those cases, the cudaMemset operation should not return until the memset operation is complete, because the affected data is host-visible. Therefore subsequent host code should be able to “see” the effect of the cudaMemset operation, as it is not asynchronous in that case. It is in effect blocking, with respect to the host thread.

no cudaDeviceSynchronize() should be needed in any of the above cases (with respect to the cudaMemset operation – use of unified memory may require a cudaDeviceSynchronize() after kernel execution, so that unified data is again “visible” to the host, but this aside has no bearing on the stated behavior of the cudaMemset operation).

Cui · January 5, 2016, 6:27pm

Yes, but actually my program has multiple threads, each launching kernels in separate streams. Suppose thread-A calls cudaMemset() and signals thread-B to run, and then thread-B launches a kernel (on the same data) in a non-default stream called stream-B. The asynchronous behavior of cudaMemset() would cause a problem.

I think I would change my cudaMemset() to cudaMemsetAsync() and let it run in a non-default stream (maybe stream-A), and then explicitly synchronize stream-A.

An alternative way would be to run cudaMemsetAsync() in stream-B, so it will guarantee to finish before the launching of following kernels.

I think in either ways, cudaMemsetAsync() should be used instead. Am I correct?

Thanks,
Cui

Robert_Crovella · January 5, 2016, 6:37pm

First of all, you shouldn’t really ever expect synchronous behavior between CUDA operations issued to different streams. That is not good CUDA programming practice, IMO. The whole point of separate streams is to de-synchronize activity.

I’m not sure that is the case.

Your non-async cudaMemset operation is issued to the default stream. Issuing an operation to the default stream has a device-wide synchronizing effect:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization

“any CUDA command to the NULL stream,”

assuming you have not modified the behavior of the default stream.

But considering all these gyrations is not the right way to go in my opinion. It’s not a sensible way to use the available APIs.

For a multi-streamed application, I would always use the async APIs. Trying to figure out the impact of the non-async API call on the behavior of a streamed application is not worth it.

Cui · January 5, 2016, 6:51pm

Based on my understanding, cudaMemset() would synchronize all streams at the time when it’s issued, but the control could be returned to host code before it’s finished. In that sense, it’s both “synchronous” (will synchronize all streams when issued) and “asynchronous” (will return before finished). What a weird design!

Totally agree. I have this cudaMemset() call because I was modifying my program from a single-stream one to a multi-stream one, and somehow I forgot to change this function. I hope changing it to cudaMemsetAsync() could fix my bug in the other thread.

Thanks you!

Cui

njuffa · January 5, 2016, 7:27pm

The other aspect you may want to examine is whether a memset() operation is actually required. In my experience there is rarely a need for such bulk initialization, whether on the host or the device.

Topic		Replies	Views
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	550	August 23, 2023
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1045	December 15, 2022
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	603	December 30, 2023
cudaMemSet with streams expected a version of cudaMemSet for steams CUDA Programming and Performance	8	6076	September 16, 2010
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1020	February 11, 2019
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024
Why cuMemset* is async? CUDA Programming and Performance	2	714	May 26, 2022
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	365	September 30, 2024
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2070	May 29, 2018
CUDA implicit synchronization behavior and conditions in detail CUDA Programming and Performance	3	1769	April 29, 2023

Is cudaMemset actually "asynchronous"?

Related topics