cuFFT Device-callable Library

Reading the info on CUDA 5 and the new K20s there was information about CUBLAS being able to be run from device code, along with mention of other libraries being converted in future. Is there any timeframe for when cuFFT is being ported (assuming it isn’t already enabled, not having a K20 I cannot check). The ability to run FFTs from onboard device code is likely to be the main selling point for us getting a K20.

I’d also ask when the next GeForce will be released, but I am pretty sure you are unable to talk about future cards :)

Hi Tiomat,

We haven’t announced when the GPU callable version of cuFFT will be released, but we are working on now.

Members of the CUDA Registered Developer Program will get early access to the pre-release builds. If you’re not already a member, you can sign up at

In what kinds of application(s) do you want to use GPU accelerated FFTs?

I am working on a product which uses an advanced imaging technique called Ptychography which requires large numbers of iterations of an algorithm to produce high resolution images with quantitative phase values. The algorithm is inherently iterative in nature so there is no obvious way to ‘batch’ the FFTs together in any meaningful way. This method contains quite a few kernels seperated by FFTs (43 Kernels, and 25 FFTs worst case) per iteration. When you are talking about hundreds of thousands of iterations, the overhead of passing control back to the CPU becomes a factor that is worth considering. As it is an imaging technique there are performance needs, and seeing the documentation about Dynamic Parallelism before this looked like a nice simple way of removing the kernel calling overhead.

Although we are currently in development rather than the optimisation stage, obvious and easy to attain speed gains are something we are continually looking for.


How about implementing a device function using some standard fft algorithm? For example the one from numerical recipes? Would that work for your problem?

I had thought about hand-cranking an FFT implementation but avoided it for a few reasons. Primarily that there were NVidia created device callable libraries on the horizon, and they are likely to be more efficient than what I would end up writing. Plus from a lazy software engineer’s point of view having less code to maintain is always good.

Hi, is there any progress on this issue? I wonder whether the Device-callable cuFFT Library has been available. If so, where can I find it? Or if there is a preliminary version, please also let me know. Thank you very much.

Shameless bump to ask whether there is any news on device-callable cuFFT. With the rumour mill at full tilt with regards to the Titan, knowing whether device-callable cuFFT will be available soon would give me the ability to get some budget set aside for R&D.


I’d like to know what do you mean when writing about device-callable cuFFT:
a) plan creation and plan execution on device
b) plan execution

If you could describe your expectations it would help a lot.

The plan execution on the GPU would be enough, as the plan creation can pretty well defined beforehand. Whether it will make a massive difference to my performance I don’t know but it might help keep the code structured in a nicer way.

I am doing a quick bump of this as I am still very interested in whether a device callable cufft library will be available soon. Currently dynamic parallelism looks to be the best way of gaining a performance improvement (wddm looks to be crippling me, the time to launch the kernels is more than my individual kernel executions leading to big gaps between the blocks of kernel executions (). Unfortunately since it relies on multiple FFTs this is not easily attainable.

Shameless bump again, any news on this from an Nvidia employee? It does seem at times that very little information comes out via the forums.

Unlike cuBLAS, the implementation of a device callable cuFFT is more complex, due to the fact that plan creation will likely still be allowed to execute on the host. We are making changes in our internal implementation to allow this functionality in the future, but we have not announced any specific release that this functionality will be available.

If WDDM is your main obstacle, and you are using a Tesla card (not GeForce), then you could also try the Windows TCC driver mode, which reduces the latency to launch CUDA kernels:

Shameless bump once again. Has there been any progress on this, or is it likely to be a post-maxwell release?

Also would it be possible for someone to give me some rough numbers on launch times using WDDM for a Tesla so I can decide how much to push trying to get hold of a K20.