What specifically is deprecated about cuFFT callbacks in CUDA 11.4?

jasonriek5l · August 10, 2021, 1:59pm

The release notes for CUDA 11.4 state:

Support for callback functionality using separately compiled device code is deprecated on all GPU architectures. Callback functionality will continue to be supported for all GPU architectures.

It’s unclear what this means exactly. I have used callback functionality since it was introduced to cuFFT, and my understanding was that it has always required separate compilation, because using callbacks requires linking against the cuFFT static library, and linking with the static library requires using separate compilation, as stated in the cuFFT documentation here:

Whereas to compile against the static cuFFT library, extra steps need to be taken. The library needs to be device linked. It may happen during building and linking of a simple program, or as a separate step. The entire process is described in Using Separarate Compilation in CUDA.

and

The cuFFT static library supports user supplied callback routines. The callback routines are CUDA device code, and must be separately compiled with NVCC and linked with the cuFFT library. Please refer to the NVCC documentation regarding separate compilation for details. If you specify an SM when compiling your callback functions, you must specify one of the SM’s cuFFT includes.

Can someone clarify what specifically has been deprecated, and what the prescribed method is for compiling/linking with cuFFT when using callback functionality going forward?

mnicely · August 10, 2021, 3:40pm

We are revamping callbacks to add flexibility and performance. We are not expecting many, if any changes, to legacy code. We will have more details in the future, closer to release.

mnicely · September 9, 2021, 3:08pm

Hi jasonriek5l,

What hardware are you currently using with callbacks?

jasonriek5l · September 9, 2021, 3:31pm

Our software is deployed with many different GPUs, so I might be missing some, but off the top of my head:

GTX 1080
Quadro P2000
Titan X (Pascal)
Titan V
V100/V100S
Titan RTX
Quadro RTX 4000
Quadro RTX 5000
A100
Jetson AGX Xavier

Basically, we are currently using some models from every generation since Pascal at this point.

josh.harvey · November 10, 2023, 10:49pm

Are there any updates on this?

Or on the cufft release 1.8 known issue:

Performance of cuFFT callback functionality was changed across all plan types and FFT sizes. Performance of a small set of cases regressed up to 0.5x, while most of the cases didn’t change performance significantly, or improved up to 2x. In addition to these performance changes, using cuFFT callbacks for loading data in out-of-place transforms might exhibit performance and memory footprint overhead for all cuFFT plan types and FFT sizes. An upcoming release will update the cuFFT callback implementation, removing the overheads and performance drops. cuFFT deprecated callback functionality based on separate compiled device code in cuFFT 11.4.

Brad_K · October 16, 2024, 4:09pm

Looking for more information on this topic.

We have a forward cufftExecC2C that uses an output callback function. The callback evaluates the magnitude of each element and only writes to dataOut when the magnitude is above threshold. The output memory range is “cudaMemsetAsync” to zero between iterations, so all nonzero values copied to output by the callback will exceed threshold.

Our callback works properly when built with CUDA Toolkit 11.4 and before. It does not work for CUDA Toolkit 11.5 and later. In the newer CUDA versions we get values below threshold in the output. They aren’t being written by our callback, so it isn’t clear where they’re coming from.

We noted the 11.4 deprecation that started this thread. Since our code works in 11.4 it seemed like the “separately compiled” deprecation wasn’t the issue, but perhaps I’m misunderstanding and 11.4 is the last version it works in.

Does anyone understand what changed in cuFFT or callbacks that has caused data to leak into the output, bypassing our callback function? We can rewrite the callback to never pass threshold, or never write to output and we will still get data where there should be zeros. Does anyone have experience like this, or a solution to get the callback or cuFFT to work like version 11.4 and prior?

We have combined all CUDA source code into a monolithic .cu file, to remove the “separately compiled device code” issue. This didn’t help, but perhaps we’re not understanding what separately compiled means. We have some C++ libraries that are linked in to the executable, but the CUDA code was all built at once.

Robert_Crovella · October 16, 2024, 4:35pm

CUFFT may launch multiple kernels/steps under the hood. The output data area may be used as a temporary data storage area, before the final results are written there in the final step.

If your callback selectively copies/writes some locations but not others, then its possible that the temporary data is “leaking” this way. A cudaMemcpyAsync operation that you create is not likely to be inserted at the proper place in the process to zero the output before the callback, but after all temporary steps.

I’m stating this based on my observation of CUFFT behavior and the description you have provided. I’m not certain it applies to your case without an example to inspect.

The CUFFT designers may choose to revamp algorithm implementation details such as what I have discussed - number of kernels/steps, use of temporary storage, etc - from time to time. I don’t have specific knowledge that such a revamp occurred or did not occur in the 11.4->11.5 timeframe.

Perhaps you could rewrite your callback to write to all output locations - those that would be non-zero, above the threshold, as well as those that should be zero - below the threshold. The thread I linked may give you some other ideas, perhaps.

Topic		Replies	Views
cuFFT callback functions not working GPU-Accelerated Libraries	1	599	July 5, 2019
What is recommended for using cufft callbacks? GPU-Accelerated Libraries cufft	0	458	December 18, 2023
Unavoidable register spilling with cuFFT callbacks GPU-Accelerated Libraries	4	1727	December 19, 2015
cufft callbacks not working with CUDA 8.0.44 with SM 37 GPU-Accelerated Libraries	3	976	November 29, 2016
cuFFT Callbacks in Shared Libraries GPU-Accelerated Libraries	3	1013	April 24, 2018
"has address taken but no possible call to it" CUDA Programming and Performance	5	1718	July 5, 2017
Streams and CUFFT CUDA Programming and Performance	8	6070	June 16, 2008
Simple cuFFT callback test not working CUDA Programming and Performance	2	1916	September 30, 2014
simpleCUFFT_callback, CUDA 7.0 and compute capability 3.7 GPU-Accelerated Libraries	4	2115	December 2, 2015
cuFFT allocating and deallocating memory at every step GPU-Accelerated Libraries cufft	3	1002	February 5, 2023

What specifically is deprecated about cuFFT callbacks in CUDA 11.4?

Related topics