Setting arrays to a value Float arrays

So from my understanding setting all elements in a float array can’t be done by cudaMemset since it is only safe for integers. So the question becomes, what is the best way of setting these elements all to one number? Say I need an array to be reset every time I iterate through a loop, and the array needs to be set to something like 1 or 0 or whatever number. Does it make sense to invoke a kernel call and design a simple kernel to help with this process? Or would it be faster if I know the element is going to be the same every time (say 1.0f) to make an array in gpu global and then just memcpy from device to device, since xfer time is very fast here?

Example:

//call only once before iteration begins.

initVector<<<dimGrid,dimBlock>>>(gpuZero, 0, nColumns);

for(int i = 0; i < nColumns; i++){ 

	cudaMemcpy(gpuTemp, gpuZero, nColumns*sizeof(float), cudaMemcpyDeviceToDevice);

	cudaMemcpy(gpuVecW, gpuZero, nColumns*sizeof(float), cudaMemcpyDeviceToDevice);

// do a bunch of computations //

}

vs.

for(int i = 0; i < nColumns; i++){  

	initVector<<<dimGrid,dimBlock>>>(gpuTemp, 0, nColumns);;

	initVector<<<dimGrid,dimBlock>>>(gpuVecW, 0, nColumns);

// do a bunch of computations //

}

I would think the memcpy would be faster, but I don’t know. Also, are there other ways that I’m not thinking about?

You can use cudaMemcpy() for the device to device copy if you have enough extra GPU memory to allocate a third buffer for the zeroes. You can use cudaMemset() if you want to initialize to 0.0f:

cudaMemset((void*) d_buffer, 0, numFloats*sizeof(float));

but it’s really slow. A small GPU kernel will do the initialization much faster. I wrote a small test program and computed the equivalent memory bandwidth of cudaMemset() and the GPU kernel that fills a float array with a constant value. On my GTX280 using Cuda 1.1:

buffer                     GPU 

  size    num     memset   kernel

  (MB)   floats   (GB/s)   (GB/s)

   0.1     25k     4.0      6.5

   0.3     75k     6.1     17.2

   1.0    250k     7.5     37.1

   3.0    750k     8.0     55.9

  10.0    2.5M     8.2     67.1

  30.0    7.5M     8.3     69.2

100.0     25M     8.3     65.1

For large buffers the GPU kernel for initialization is giving about 60% of the device to device bandwidth as reported by bandwidthTest (116 GB/s).

Awesome. I ended up using a kernel when I had to and stored a vector with all elements set to 0 as I needed it frequently and had space for it. I then just cudaMemCpy’ed the 0 vector to where I needed it. My kernel was something like vector at [Idx] = value where Idx is the standard way to access elements in an array by unique thread/block ID and value was just the value I wanted the array initialized to. I didn’t realize that memset was that slow, and avoided it mostly because I had to initialize to other values than 0 occasionally so I thought I might as well use the same kernel for my 0 vectors.