Does pre-malloc a large cuda buffer hurt performance

I put in a change for how I’m using CUFFT so that I only create a single buffer on the C1060 card of
size: cudaMalloc(buffer, 2*sizeof(double)MAXNUM_FFTSMAX_BATCHES);

I then don’t need to ever malloc and free the buffer again until I free it with cudaFree() and program exit.
I reuse the buffer over and over for all the FFTs I compute since any combination of them will
fit in the maximum sized buffer I made.

But I’m seeing some pretty low performance scores, is there a negative effect to CUDA when you make a very
large buffer on the device like this? Note that the total size product of the above is about 500Megs
so this buffer is about 1/8th of the total card memory which 4GB.

My theory is that CUDA can’t do various memory optimizations it would usually do because the buffer
is so large?


This shouldn’t make any difference. It could be an alignment issue, possibly?