cudaMemset question

Im allocating a float array using cudaMalloc of size: 84,044,410 so a total of 336,177,640 --> ~320MB.
I then reset its content using cudaMemset:
CUDA_SAFE_CALL( cudaMemset( pDeviceOutput1, 0, iRawDataSize * sizeof( float ) ) ); //iRawDataSize == 84,044,410
This takes ~40ms. Is it possible? reasonable? is there a better/faster way to do this?

thanks in advance

Let’s see:
(336177640 bytes / 40e-3 seconds) / (1 024^3 bytes/GiB) = 7.82724563 GiB/s

What hardware are you running on. That is roughly 1/10th the bandwidth available on 8800 GTX.

Although, I do recall another user on the forums finding a similar performance problem with cudaMemset before. You could simply write a kernel that writes 0’s to all the floats in a fully coalesced manner to get the full bandwidth of the device.


Thanks for the response. Im using the GeForce GTX 280.

I’ll try to use the kernel