cuFFT Device-callable Library

Tiomat · November 5, 2012, 9:22am

Reading the info on CUDA 5 and the new K20s there was information about CUBLAS being able to be run from device code, along with mention of other libraries being converted in future. Is there any timeframe for when cuFFT is being ported (assuming it isn’t already enabled, not having a K20 I cannot check). The ability to run FFTs from onboard device code is likely to be the main selling point for us getting a K20.

I’d also ask when the next GeForce will be released, but I am pretty sure you are unable to talk about future cards :)

Will_Ramey · November 6, 2012, 8:13pm

Hi Tiomat,

We haven’t announced when the GPU callable version of cuFFT will be released, but we are working on now.

Members of the CUDA Registered Developer Program will get early access to the pre-release builds. If you’re not already a member, you can sign up at NVIDIA Developer Program | NVIDIA Developer

In what kinds of application(s) do you want to use GPU accelerated FFTs?

Tiomat · November 7, 2012, 10:48am

I am working on a product which uses an advanced imaging technique called Ptychography which requires large numbers of iterations of an algorithm to produce high resolution images with quantitative phase values. The algorithm is inherently iterative in nature so there is no obvious way to ‘batch’ the FFTs together in any meaningful way. This method contains quite a few kernels seperated by FFTs (43 Kernels, and 25 FFTs worst case) per iteration. When you are talking about hundreds of thousands of iterations, the overhead of passing control back to the CPU becomes a factor that is worth considering. As it is an imaging technique there are performance needs, and seeing the documentation about Dynamic Parallelism before this looked like a nice simple way of removing the kernel calling overhead.

Although we are currently in development rather than the optimisation stage, obvious and easy to attain speed gains are something we are continually looking for.

Thanks,
Tiomat

pasoleatis · January 12, 2013, 12:42pm

How about implementing a device function using some standard fft algorithm? For example the one from numerical recipes? Would that work for your problem?

Tiomat · January 14, 2013, 9:28am

I had thought about hand-cranking an FFT implementation but avoided it for a few reasons. Primarily that there were NVidia created device callable libraries on the horizon, and they are likely to be more efficient than what I would end up writing. Plus from a lazy software engineer’s point of view having less code to maintain is always good.

zhangdw · January 23, 2013, 2:34am

Hi, is there any progress on this issue? I wonder whether the Device-callable cuFFT Library has been available. If so, where can I find it? Or if there is a preliminary version, please also let me know. Thank you very much.

Tiomat · February 14, 2013, 11:45am

Shameless bump to ask whether there is any news on device-callable cuFFT. With the rumour mill at full tilt with regards to the Titan, knowing whether device-callable cuFFT will be available soon would give me the ability to get some budget set aside for R&D.

lligowski · February 19, 2013, 11:48pm

Hi,

I’d like to know what do you mean when writing about device-callable cuFFT:
a) plan creation and plan execution on device
b) plan execution

If you could describe your expectations it would help a lot.

Tiomat · February 22, 2013, 9:20am

The plan execution on the GPU would be enough, as the plan creation can pretty well defined beforehand. Whether it will make a massive difference to my performance I don’t know but it might help keep the code structured in a nicer way.

Tiomat · June 27, 2013, 11:07am

I am doing a quick bump of this as I am still very interested in whether a device callable cufft library will be available soon. Currently dynamic parallelism looks to be the best way of gaining a performance improvement (wddm looks to be crippling me, the time to launch the kernels is more than my individual kernel executions leading to big gaps between the blocks of kernel executions (). Unfortunately since it relies on multiple FFTs this is not easily attainable.

Tiomat · September 3, 2013, 9:14am

Shameless bump again, any news on this from an Nvidia employee? It does seem at times that very little information comes out via the forums.
Tiomat

ukapasi · September 23, 2013, 8:05pm

Unlike cuBLAS, the implementation of a device callable cuFFT is more complex, due to the fact that plan creation will likely still be allowed to execute on the host. We are making changes in our internal implementation to allow this functionality in the future, but we have not announced any specific release that this functionality will be available.

If WDDM is your main obstacle, and you are using a Tesla card (not GeForce), then you could also try the Windows TCC driver mode, which reduces the latency to launch CUDA kernels:

CUDA Toolkit Documentation

Tiomat · November 21, 2013, 10:27am

Shameless bump once again. Has there been any progress on this, or is it likely to be a post-maxwell release?

Also would it be possible for someone to give me some rough numbers on launch times using WDDM for a Tesla so I can decide how much to push trying to get hold of a K20.

Thanks,
TIomat

Topic		Replies	Views
Dynamic Parallelism and FFT CUDA Programming and Performance	4	1313	July 17, 2013
What specifically is deprecated about cuFFT callbacks in CUDA 11.4? GPU-Accelerated Libraries cufft	6	1205	October 16, 2024
cuFFT cudaFuncSetCacheConfig GPU-Accelerated Libraries	3	1669	November 9, 2012
cuFFT Device-callable Library CUDA Programming and Performance	1	661	January 27, 2013
cuFFT 3.1 and data alignment with CUDA FFT library problem CUDA Programming and Performance	16	15830	August 24, 2010
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15485	May 4, 2010
Profiling using cuFFT GPU-Accelerated Libraries	9	834	December 5, 2019
What is recommended for using cufft callbacks? GPU-Accelerated Libraries cufft	0	429	December 18, 2023
Using a CUDA library call as a device function instead of a kernel launch CUDA Programming and Performance	3	546	April 2, 2018
Advice on porting to an HPC application to GPU nvc, nvc++ and nvfortran	6	58	July 30, 2024

cuFFT Device-callable Library

Related topics