What specifically is deprecated about cuFFT callbacks in CUDA 11.4?

The release notes for CUDA 11.4 state:

Support for callback functionality using separately compiled device code is deprecated on all GPU architectures. Callback functionality will continue to be supported for all GPU architectures.

It’s unclear what this means exactly. I have used callback functionality since it was introduced to cuFFT, and my understanding was that it has always required separate compilation, because using callbacks requires linking against the cuFFT static library, and linking with the static library requires using separate compilation, as stated in the cuFFT documentation here:

Whereas to compile against the static cuFFT library, extra steps need to be taken. The library needs to be device linked. It may happen during building and linking of a simple program, or as a separate step. The entire process is described in Using Separarate Compilation in CUDA.

and

The cuFFT static library supports user supplied callback routines. The callback routines are CUDA device code, and must be separately compiled with NVCC and linked with the cuFFT library. Please refer to the NVCC documentation regarding separate compilation for details. If you specify an SM when compiling your callback functions, you must specify one of the SM’s cuFFT includes.

Can someone clarify what specifically has been deprecated, and what the prescribed method is for compiling/linking with cuFFT when using callback functionality going forward?

We are revamping callbacks to add flexibility and performance. We are not expecting many, if any changes, to legacy code. We will have more details in the future, closer to release.

Hi jasonriek5l,

What hardware are you currently using with callbacks?

Our software is deployed with many different GPUs, so I might be missing some, but off the top of my head:

  • GTX 1080
  • Quadro P2000
  • Titan X (Pascal)
  • Titan V
  • V100/V100S
  • Titan RTX
  • Quadro RTX 4000
  • Quadro RTX 5000
  • A100
  • Jetson AGX Xavier

Basically, we are currently using some models from every generation since Pascal at this point.

Are there any updates on this?

Or on the cufft release 1.8 known issue:

  • Performance of cuFFT callback functionality was changed across all plan types and FFT sizes. Performance of a small set of cases regressed up to 0.5x, while most of the cases didn’t change performance significantly, or improved up to 2x. In addition to these performance changes, using cuFFT callbacks for loading data in out-of-place transforms might exhibit performance and memory footprint overhead for all cuFFT plan types and FFT sizes. An upcoming release will update the cuFFT callback implementation, removing the overheads and performance drops. cuFFT deprecated callback functionality based on separate compiled device code in cuFFT 11.4.

Looking for more information on this topic.

We have a forward cufftExecC2C that uses an output callback function. The callback evaluates the magnitude of each element and only writes to dataOut when the magnitude is above threshold. The output memory range is “cudaMemsetAsync” to zero between iterations, so all nonzero values copied to output by the callback will exceed threshold.

Our callback works properly when built with CUDA Toolkit 11.4 and before. It does not work for CUDA Toolkit 11.5 and later. In the newer CUDA versions we get values below threshold in the output. They aren’t being written by our callback, so it isn’t clear where they’re coming from.

We noted the 11.4 deprecation that started this thread. Since our code works in 11.4 it seemed like the “separately compiled” deprecation wasn’t the issue, but perhaps I’m misunderstanding and 11.4 is the last version it works in.

Does anyone understand what changed in cuFFT or callbacks that has caused data to leak into the output, bypassing our callback function? We can rewrite the callback to never pass threshold, or never write to output and we will still get data where there should be zeros. Does anyone have experience like this, or a solution to get the callback or cuFFT to work like version 11.4 and prior?

We have combined all CUDA source code into a monolithic .cu file, to remove the “separately compiled device code” issue. This didn’t help, but perhaps we’re not understanding what separately compiled means. We have some C++ libraries that are linked in to the executable, but the CUDA code was all built at once.

CUFFT may launch multiple kernels/steps under the hood. The output data area may be used as a temporary data storage area, before the final results are written there in the final step.

If your callback selectively copies/writes some locations but not others, then its possible that the temporary data is “leaking” this way. A cudaMemcpyAsync operation that you create is not likely to be inserted at the proper place in the process to zero the output before the callback, but after all temporary steps.

I’m stating this based on my observation of CUFFT behavior and the description you have provided. I’m not certain it applies to your case without an example to inspect.

The CUFFT designers may choose to revamp algorithm implementation details such as what I have discussed - number of kernels/steps, use of temporary storage, etc - from time to time. I don’t have specific knowledge that such a revamp occurred or did not occur in the 11.4->11.5 timeframe.

Perhaps you could rewrite your callback to write to all output locations - those that would be non-zero, above the threshold, as well as those that should be zero - below the threshold. The thread I linked may give you some other ideas, perhaps.