Is cudaMemset() function really efficient? Can I use another method to initialize data structure in global memory?

for(int i=blockIdx.x*blockDim.x + threadIdx.x; i<N; i+=blockDim.x*gridDim.x)




All accesses are coalesced. What more performance can this possibly give? We can make “MEM” as a “INT *” ptr always and make N as ORIGINAL_N/4 and DESIRED_VALUE as INT32 with packed bytes. It would make sense to UNROLL this loop accordingly.

I hope cudaMemset would be written the same way. So, I really wonder if you would benefit from anything else… unless you are using some non-CUDA features of graphics card to do some magic…


It seems like the cudaMemset-Method is faster. I made some tests to see if cudaMemcpy or copying by kernel invokation is faster… well… cudaMemcpy is.

So I suppose cudaMemset should be faster as well. There is no kernel stream that has to be opened. By the way… I tested copying with a total vectorlength of 256000. Maybe the results differ for smaller sizes…

I would have thought that cudaMemset would be better than explictly doing it.

It may be possible to use neither. For example if you are doing many ‘+=’ operations to simply make the first one ‘=’. This requires more explicit coding, but should be better than using memset.

If I need it to initialize a vector to zero, memset is faster?

Than what? Explicitly writing a kernel to do it? Probably.

Initialising in the kernel that you use it may or may not be better, depending on how you use it…