I want to use the cuFFT library, but I don’t want the overhead of launching another kernel.
Is it possible to adapt the library call into a device function so that I can just call it from an already launched function?
not possible for CUFFT. CUBLAS has support for this.
So is there a common solution people have to dealing with such things if they want to call these library kernels if they don’t want a lot of kernel launch overhead?
common strategies for improving CUFFT efficiency:
- batching of transforms
- using the CUFFT API to manage temporary allocations (“workspace”) yourself
- reuse of plans