fastest way to initialise large arrays cudaMemset v cudaMemcpyDeviceToDevice

I think the need to initialise large arrays comes up many times.
Is there any hardware support for this?
(I seem to remember on the DEC VAX computer you could ask for a zero page of memory,
which gave you n*512 bytes all preset to zero.)
Perhaps on a GPU it would be nice to have values other than zero.

This thought was prompted by noting on
(half of a) GeForce GTX 295 cudaMemset clears about 14.5 billion bytes/second,
whereas the SDK bandwidthTest’s cudaMemcpy (cudaMemcpyDeviceToDevice) claims 93 billion bytes/sec.
Why the difference?

ps: was incorrectly posted to apple forum:-(

One possibility (and this is all speculation, since I don’t have access to NVIDIA’s actual code): cudaMemset operates on bytes, which aren’t a natural size for the GPU, while cudaMemcpy might be written to operate on a ‘natural’ size for the GPU. In this case, cudaMemset could presumably be optimised. Alternatively, it might be as simple as cudaMemset having to launch a kernel, but cudaMemcpy being able to send a single instruction to the GPU’s memory controller.

The fastest way is likely to write your own kenel which parses the data and sets it too a given value, the memset and memcpy functions are not the fastest possible…

Alex Dubinsky found that small kernels were actually the fastest way to zero or copy memory.
This was a pre-Fermi experiment , but it would be easy to test again with a Fermi board.

Ah, my memory was correct…here’s the test from over two years ago.

cudaMemSet working on bytes??? Hmm… I hope NV re-writes it for us…

Yes I agree. I assumed nVidia would make a better job of coding basic functionality

than I would. They are the experts and only have to do it once. Whilst there are n of us

so we have to do it n times (and debug it n times:-(

I wonder if anyone from nVidia can comment on the original idea:


Please note that I was only speculating. This was based on the fact that cudaMemset, like memset itself takes a byte argument - it doesn’t operate like std::fill. So a naive implementation would launch one thread per byte, rather than the optimal one thread per word (i.e. internally repeat the byte four times). I don’t know if this is the case, but it strikes me as a possibility (I’ve certainly found cudaMemset3D to be less than perfectly reliable).

Use thrust::fill for this.