Hi,
Im allocating a float array using cudaMalloc of size: 84,044,410 so a total of 336,177,640 → ~320MB.
I then reset its content using cudaMemset:
CUDA_SAFE_CALL( cudaMemset( pDeviceOutput1, 0, iRawDataSize * sizeof( float ) ) ); //iRawDataSize == 84,044,410
This takes ~40ms. Is it possible? reasonable? is there a better/faster way to do this?
What hardware are you running on. That is roughly 1/10th the bandwidth available on 8800 GTX.
Although, I do recall another user on the forums finding a similar performance problem with cudaMemset before. You could simply write a kernel that writes 0’s to all the floats in a fully coalesced manner to get the full bandwidth of the device.