Just out of curiosity, driver api spec says about Memset " The synchronous memset functions are asynchronous with respect to the host except when the target is pinned host memory or a Unified Memory region, in which case they are fully synchronous. The Async versions are always asynchronous with respect to the host." , my question is why memset is async? There is already an Async version API. Is it a historical issue?
Because memset here is (ordinarily) affecting device memory, which is not (ordinarily) directly accessible by host code, there is no particular reason to make the call synchronous with respect to the host thread. If the host thread wanted to observe the result, it would have to (ordinarily) use something like
cudaMemcpy to transfer the data from device to host, at which point consistency of data could/would be ensured.
Therefore in the general case there is no hazard in making it asynchronous with respect to host thread, so making it asynchronous is generally better for async work issuance strategies. Note that the explicit async versions are also taking into account the possibility for a supplied stream, and therefore can be used in a stream-aware way, which impacts device behavior, so there is a logical distinction here still.
In the case of pinned memory or UM target, then the results of the memset operation are immediately visible in host code, and therefore it makes sense that these operations are made synchronous with respect to host thread, so as to ensure consistency of data.
That makes sense, thanks for your patient!