No, it means they may be asynchronous with respect to the host. That means it may be like a kernel call. Since cudaMemset (normally **) has no bearing on any host data, this should not matter.
All CUDA calls issued to a particular stream will be executed in order, with respect to other CUDA activity issued to the same stream.
Therefore if you do a cudaMemset, followed by a kernel call, both in the same stream (or both to the default stream) you can be assured that all of the results of the cudaMemset operation will be visible by any kernel activity.
** If the target of the cudaMemset operation is either pinned host memory, or a unified memory region, then either of these are visible to host code. In that situation, the stated asynchronous behavior does not apply, with respect to the host, to preserve sensible program semantics. In those cases, the cudaMemset operation should not return until the memset operation is complete, because the affected data is host-visible. Therefore subsequent host code should be able to “see” the effect of the cudaMemset operation, as it is not asynchronous in that case. It is in effect blocking, with respect to the host thread.
no cudaDeviceSynchronize() should be needed in any of the above cases (with respect to the cudaMemset operation – use of unified memory may require a cudaDeviceSynchronize() after kernel execution, so that unified data is again “visible” to the host, but this aside has no bearing on the stated behavior of the cudaMemset operation).