cudaMemset()

francy300485 · November 25, 2009, 10:59am

Is cudaMemset() function really efficient? Can I use another method to initialize data structure in global memory?

Sarnath · November 25, 2009, 11:06am

for(int i=blockIdx.x*blockDim.x + threadIdx.x; i<N; i+=blockDim.x*gridDim.x)

{

  MEM[i] = DESIRED_VALUE;

}

All accesses are coalesced. What more performance can this possibly give? We can make “MEM” as a “INT *” ptr always and make N as ORIGINAL_N/4 and DESIRED_VALUE as INT32 with packed bytes. It would make sense to UNROLL this loop accordingly.

I hope cudaMemset would be written the same way. So, I really wonder if you would benefit from anything else… unless you are using some non-CUDA features of graphics card to do some magic…

francy300485 · November 25, 2009, 11:25am

for(int i=blockIdx.x*blockDim.x + threadIdx.x; i<N; i+=blockDim.x*gridDim.x)

{

  MEM[i] = DESIRED_VALUE;

}
All accesses are coalesced. What more performance can this possibly give? We can make “MEM” as a “INT *” ptr always and make N as ORIGINAL_N/4 and DESIRED_VALUE as INT32 with packed bytes. It would make sense to UNROLL this loop accordingly.

I hope cudaMemset would be written the same way. So, I really wonder if you would benefit from anything else… unless you are using some non-CUDA features of graphics card to do some magic…

Thanks.

Ichabod · November 26, 2009, 4:20pm

It seems like the cudaMemset-Method is faster. I made some tests to see if cudaMemcpy or copying by kernel invokation is faster… well… cudaMemcpy is.

So I suppose cudaMemset should be faster as well. There is no kernel stream that has to be opened. By the way… I tested copying with a total vectorlength of 256000. Maybe the results differ for smaller sizes…

Tigga · November 26, 2009, 4:27pm

I would have thought that cudaMemset would be better than explictly doing it.

It may be possible to use neither. For example if you are doing many ‘+=’ operations to simply make the first one ‘=’. This requires more explicit coding, but should be better than using memset.

francy300485 · November 26, 2009, 5:29pm

If I need it to initialize a vector to zero, memset is faster?

Tigga · November 26, 2009, 5:46pm

Than what? Explicitly writing a kernel to do it? Probably.

Initialising in the kernel that you use it may or may not be better, depending on how you use it…

Topic		Replies	Views
Memset? CUDA Programming and Performance	9	1104	June 17, 2024
cudaMemset question CUDA Programming and Performance	2	8375	October 29, 2008
fastest way to initialise large arrays cudaMemset v cudaMemcpyDeviceToDevice CUDA Programming and Performance	7	17901	March 22, 2011
cudaMemset bug cudaMemset, is it really so slow ?? CUDA Programming and Performance	1	4304	December 3, 2009
Int Array initialization CUDA Programming and Performance	5	3635	March 18, 2011
CudaMemset internally calls a kernel? CUDA Programming and Performance cuda , kernel , cuda-gdb	2	1236	April 3, 2023
cudaMemset why this? CUDA Programming and Performance	3	359	March 22, 2024
cudaMemset too slow on Xavier Jetson AGX Xavier cuda	6	1210	October 18, 2021
malloc & cudaMalloc confusion over initialization of the two CUDA Programming and Performance	27	4017	September 30, 2010
Any way to clear the global memory faster CUDA Programming and Performance	4	3837	January 22, 2009

cudaMemset()

Related topics