We are trying to run CUFFT in shared memory by copying the input to a shared locn block by block, But it seems that CUFFT look at global memory for the input and keeps at result at output.
Can we performed it in shared memory???
Thanks for the Reply
Correct, CUFFT is calling one or more kernels internally, accepting only global memory addresses. Having in mind that shared memory has the lifetime of thread block and one can’t perform nested kernel launches, global memory “staging” is the only way to go with CUFFT. But you should not be afraid of this since G80’s global memory has very high bandwidth (and global memory staging occurs in CUFFT itself for most plan configurations). What are the transform sizes you need?
Even if CUFFT routines were supplied as device functions (and thus were kernel-callable) it would introduce significant register pressure (in addition to user code) which in its turn would probably hurt rather than help performance. But anyway the issue is worth investigating.
Thanks for the question!