Using a CUDA library call as a device function instead of a kernel launch

I want to use the cuFFT library, but I don’t want the overhead of launching another kernel.
Is it possible to adapt the library call into a device function so that I can just call it from an already launched function?

not possible for CUFFT. CUBLAS has support for this.

So is there a common solution people have to dealing with such things if they want to call these library kernels if they don’t want a lot of kernel launch overhead?

common strategies for improving CUFFT efficiency:

  • batching of transforms
  • using the CUFFT API to manage temporary allocations (“workspace”) yourself
  • reuse of plans