I put in a change for how I’m using CUFFT so that I only create a single buffer on the C1060 card of
size: cudaMalloc(buffer, 2*sizeof(double)MAXNUM_FFTSMAX_BATCHES);
I then don’t need to ever malloc and free the buffer again until I free it with cudaFree() and program exit.
I reuse the buffer over and over for all the FFTs I compute since any combination of them will
fit in the maximum sized buffer I made.
But I’m seeing some pretty low performance scores, is there a negative effect to CUDA when you make a very
large buffer on the device like this? Note that the total size product of the above is about 500Megs
so this buffer is about 1/8th of the total card memory which 4GB.
My theory is that CUDA can’t do various memory optimizations it would usually do because the buffer
is so large?