cudaMemSet with streams expected a version of cudaMemSet for steams

Hello,

Seems to me that the function cudaMemset should have an equivalent function for streams, something like cudaMemsetAsync, but I don’t see any in the documentation. How come? What do I use instead? Does anyone know?

thanks in advance,
NW

I haven’t tested it, but I presume that cudaMemset is already async. It is not as if the call actually has to copy that much data up to the GPU.

I haven’t tested it, but I presume that cudaMemset is already async. It is not as if the call actually has to copy that much data up to the GPU.

Thanks for the post.

I can’t see why it would need to copy any data. But if the memset is being preceded by async calls for memcpy and kernel functions, I don’t want memset to have to wait on these tasks before returning control from the memset. Seems counter productive.

Thanks for the post.

I can’t see why it would need to copy any data. But if the memset is being preceded by async calls for memcpy and kernel functions, I don’t want memset to have to wait on these tasks before returning control from the memset. Seems counter productive.

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:

template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}

You can push the kernel into whatever stream you want asynchronously. This has the added advantage that you can set word sized values rather than just bytes.

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:

template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}

You can push the kernel into whatever stream you want asynchronously. This has the added advantage that you can set word sized values rather than just bytes.

Excellent idea—i didn’t think to just make it another kernel call. Thanks!

Excellent idea—i didn’t think to just make it another kernel call. Thanks!