cufftExecR2C calls and performance

I’ve been porting some “old” (I say old, but it’s only been a year!) GPU code to use streams and started profiling the application using the visual profiler (nvvp). I noticed that cufftExecR2C makes calls to cudaBindTexture then a bunch of calls to cudaSetupArgument, then the launch and finally a call to cudaUnbindTexture. I was wondering what the texture calls had to do with executing an FFT? It seems like there’s a lot of overhead that goes into performing this. Are these texture calls normal? Are there any “best practices” out there for performing FFT’s with the GPU? I’m obviously trying to milk as much out of this as I can and this execution time was the largest in our application.

Some more background for those interested. The application reads data in and processes FFT_SIZE data points per loop. The very first step in this process is the cufftExecR2C call followed by getting the PSD and some other data. The PSD kernel and the second kernel preform relatively fast; the FFT prep and tear down seem to take longer than those two kernels combined.

Thanks in advance for any thoughts.