cudaMemSet with streams expected a version of cudaMemSet for steams

NeedWisdom · September 15, 2010, 7:23pm

Hello,

Seems to me that the function cudaMemset should have an equivalent function for streams, something like cudaMemsetAsync, but I don’t see any in the documentation. How come? What do I use instead? Does anyone know?

thanks in advance,
NW

MisterAnderson42 · September 16, 2010, 1:27am

I haven’t tested it, but I presume that cudaMemset is already async. It is not as if the call actually has to copy that much data up to the GPU.

MisterAnderson42 · September 16, 2010, 1:27am

I haven’t tested it, but I presume that cudaMemset is already async. It is not as if the call actually has to copy that much data up to the GPU.

NeedWisdom · September 16, 2010, 12:35pm

Thanks for the post.

I can’t see why it would need to copy any data. But if the memset is being preceded by async calls for memcpy and kernel functions, I don’t want memset to have to wait on these tasks before returning control from the memset. Seems counter productive.

NeedWisdom · September 16, 2010, 12:35pm

Thanks for the post.

I can’t see why it would need to copy any data. But if the memset is being preceded by async calls for memcpy and kernel functions, I don’t want memset to have to wait on these tasks before returning control from the memset. Seems counter productive.

avidday · September 16, 2010, 2:44pm

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:

template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}

You can push the kernel into whatever stream you want asynchronously. This has the added advantage that you can set word sized values rather than just bytes.

avidday · September 16, 2010, 2:44pm

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:

template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}

You can push the kernel into whatever stream you want asynchronously. This has the added advantage that you can set word sized values rather than just bytes.

NeedWisdom · September 16, 2010, 2:45pm

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:
template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}
this has the added advantage that you can set word sized values rather than just bytes.

Excellent idea—i didn’t think to just make it another kernel call. Thanks!

NeedWisdom · September 16, 2010, 2:45pm

Another alternative is to “roll your own” - memset doesn’t have to be anything more complex than something like this:
template < typename Dtype >

__global__ void deviceMemset(Dtype * mem, const Dtype val, size_t n)

{

	volatile int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	volatile int stride = gridDim.x * blockDim.x;

	

	for (int i = tidx; i < n; i+=stride) { mem[i] = val; }

}
this has the added advantage that you can set word sized values rather than just bytes.

Excellent idea—i didn’t think to just make it another kernel call. Thanks!

Topic		Replies	Views
Trying to run cudaMemsetAsync in a more timely manner CUDA Programming and Performance	8	1293	September 15, 2019
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7936	January 5, 2016
cudaMemsetAsync and easier syntax for async copying Legacy PGI Compilers	2	3535	December 21, 2012
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	576	August 23, 2023
No cudaMemsetAsync? CUDA Programming and Performance	1	8476	September 26, 2008
Memset? CUDA Programming and Performance	9	835	June 17, 2024
Why cuMemset* is async? CUDA Programming and Performance	2	741	May 26, 2022
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1121	December 15, 2022
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	674	December 30, 2023
Device side cudaMemsetAsync inoperable on device allocated global memory CUDA Programming and Performance	0	388	May 11, 2022

cudaMemSet with streams expected a version of cudaMemSet for steams

Related topics