CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Originally published at:

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of…

Thanks for posting this. My only question is what kind of operation you're doing with the data by just multiplying the samples? Assuming this is a continuous stream of input data, wouldn't you need to either do overlap-add or overlap-save with adjacent blocks of samples to produce a meaningful output?

In this somewhat simplified example I use the multiplication as a general convolution operation for illustrative purposes. You are right that if we are dealing with a continuous input stream we probably want to do overlap-add or overlap-save between the segments--both of which have the multiplication at its core, however, and mostly differ by the way you split and recombine the signal. For other kinds of input data, such as correlating samples from multiple sources or processing image data, the operations you want to perform may look different.

Hello Christoph,

I want to ask you if the CUFFT callbacks will become part of the CUDA FFT shared library. Linking with the static library is a little problematic, for some of us using CMake. Also, I heard that in CUDA 7.0 the license is not longer necessary. Can you confirm?

Thanks for the great tutorial.

The license is not longer required in CUDA 7.0. Adding callbacks support in the shared library is a lot of work, which we may do in the future, but we would need to justify the effort with user benefits. So can you explain specifically why using the static library is problematic with CMake?

Hello Mark,

I'm not very good at CMake. Still I found it more convenient than writing Makefiles. Using CMake project files to build CUDA code is usually straight forward, but the static library requires separable compilation and that I haven't succeed using CMake.

My point is that if you already have a project that uses the CUFFT shared library, it would be easier to start using CUFFT callbacks if you can avoid trying to figure out why the CMake project file is no longer working.

I wrote a toy example, of what I'm trying to attempt, and I have share the code in github [1]. I can compile and run such example using the command line.

nvcc -arch=sm_35 -rdc=true -c src/
nvcc -arch=sm_35 -dlink -o thrust_fft_example_link.o thrust_fft_example.o -lcudart -lcufft_static
g++ thrust_fft_example.o thrust_fft_example_link.o -L/usr/local/cuda-6.5/lib64 -lcudart -lcufft_static -lculibos

I tried to compile the same example using CMake and I got stuck at the linking step. CMake translate the above three lines into:

/usr/local/cuda-6.5/bin/nvcc /home/workspace/thrust_fft/src/ -dc -o /home/workspace/thrust_fft/build/CMakeFiles/cuda_compile.dir/src/./ -ccbin /usr/bin/gcc-4.6 -m64 -Xcompiler ,\"-g\",\"-L/usr/local/cuda-6.5/lib64\",\"-g\" -gencode=arch=compute_35,code=sm_35 -DNVCC -I/usr/local/cuda-6.5/include

/usr/local/cuda-6.5/bin/nvcc -gencode=arch=compute_35,code=sm_35 -m64 -ccbin "/usr/bin/gcc-4.6" -dlink /home/workspace/thrust_fft/build/CMakeFiles/cuda_compile.dir/src/./ -o /home/workspace/thrust_fft/build/CMakeFiles/thrust_fft.dir/./thrust_fft_intermediate_link.o

/usr/bin/c++ -g -L/usr/local/cuda-6.5/lib64 CMakeFiles/cuda_compile.dir/src/ CMakeFiles/thrust_fft.dir/thrust_fft_intermediate_link.o -o thrust_fft -rdynamic /usr/local/cuda-6.5/lib64/ /usr/local/cuda-6.5/lib64/libcufft_static.a /usr/local/cuda-6.5/lib64/libculibos.a -Wl,-rpath,/usr/local/cuda-6.5/lib64:

Those extra compiler flags "-g -L/usr/local/cuda-6.5/lib64" are there because the FinCUDA.cmake needs the variable CMAKE_CXX_FLAGS to be defined in order to work. That's also where I got stuck. Hopefully you guys can spot what I'm doing wrong.

What is culibos? What happens when a CUFFT plan is employed to compute DIRECT and INVERSE FFT? Can you selectively tell when you want the callbacks to be used?

The last question: What is the advantage of using CUFFT callbacks vs. Thrust callbacks?

[1] Thrust and CUFFT callback example github repository.

EDIT 1: After posting this message I found that the second line which is automatically generated by CMake is missing the -lcudart and -lcufft_static flags. Adding those manually will make it succeed. The reason why they got removed is still not clear. I will ask in the CMake list. Still I hope you could answer my other questions. Thanks.

I'm curious- is something similar possible with cuBLAS functions? I often need to compute a matrix-matrix multiplication followed by an elementwise operation. The trips to global memory are my main bottleneck.

I tried to implement this example on a GTX 750. I was successful until I added callbacks. I am using Nsight Eclipse to develop as supplied with CUDA 7.0. I tried modifying the generated makefiles to include the compile/link flags above. Comparing the output (I wrote the output to disk and plotted it) to the non-callback version and the callback version leads me to believe that the callback is not being applied.

I also tried compiling with the flags listed in the article. I get errors about SM Arch not found. (Nsight automatically compiles the code for me for my target architecture.)

Is there a way to get this working with Nsight by default?

The trick is to go Project->Properties->Build->Settings>Device linker mode: Separate compilation.

Follow-up question: is --relocatable-device-code=true is equivalent to -dc ?

Almost: "-dc" is equivalent to "--relocatable-device-code true -c" or "-rdc=true -c". Note the "-c" to generate an object file. See the nvcc manual:

I've tried out the source from github on a Jetson TX1. Without
changing anything in the source, I executed the two applications a lot
of times one after the other and I've observed the following: the
no_callback one finishes at ~260-280ms each run but the callback one
performs very differently each time (finishing at random values between
~170-260ms, also occasionally at higher values, max I've seen is

I've also tried running only the callback version one after another, produces the same result. Do you have an idea what might cause this sort of behavior?

For small workload sizes it's not uncommon to see runtime variance. I suggest you run multiple times and average.

Hi Mark,

Thanks for the answer. As I've said in the post I've used exactly the same code from github which does a 1000x1024 operation with 100 iterations. I've also ran *this* code (1000x1024, 100 iter) multiple times (~50 times) and took averages.

No_callback -> in the 260-280ms range (avg ~= 270ms)
Callback -> in the 170-600ms range (avg ~= 350ms)

Do you have an idea what might cause this sort of behavior?


I know this is an old post, but I think this is still an important topic.

I think this example undersells the potential throughput improvements from using callbacks. I believe that the transpose step in the store callback prevents the store operations from being coalesced, greatly reducing potential throughput, especially with larger transform sizes. In my testing, removing the transpose stage (which isn't necessary for many frequency domain filtering applications anyway) resulted in halving the run-time vs the non-callback case. In other words, the callback case is twice as fast.

There's also a mistake in your store callback. You are computing the product of "filter" and "element" and storing that in "value", but then you write out "element" to dataOut instead of the product stored in "value". I haven't looked at the assembly, but I would expect the compiler to optimize out the complex multiply and the read of filter because of that.

Thanks for your comment, Brett. You are right, if you don't require the transpose you can get better coalescing by not transposing, resulting in higher performance. The application that was underlying our experiments expected the data to be transposed, which is why we included it in our code. The exact performance of cuFFT callbacks depends on the CUDA version and GPU you are using and both have changed significantly over the last years. On my V100 I am measuring a ~30%-35% performance increase by not transposing, which is a bit less than the 2x you experienced but still very significant. Thank you for this suggestion!

Regarding the mistake in the store callback: Thanks for the catch. Luckily this is only a bug in the blog post. The code on github does the store in one statement and doesn't use a temporary value variable. We will fix the text.

I know this post is old but I just implemented a small example which turns out to be a lot slower. As a test I am trying to replace the following Kernel:

extern "C"
__global__ void mul_Real(int size, float val, float* outVol)
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
const unsigned int z = blockIdx.z * blockDim.z + threadIdx.z;

outVol[z * size * size + y * size + x] = outVol[z * size * size + y * size + x] * val;

with the following Callback:

__device__ void CB_TEST(void *dataOut, size_t offset, cufftReal element, void *callerInfo, void *sharedPtr) {
float *filter = (float*)callerInfo;
((cufftReal*)dataOut)[offset] = element*filter[0];

__managed__ cufftCallbackStoreR d_storeCallbackPtr = CB_TEST;

The execution of the FFT is called like this:

cufftPlanMany(&ffthandle, 3, {64,64,64}, NULL, 0, 0, NULL, 0, 0, CUFFT_C2R, 1);

I would expect the that (since I don't have to touch the Real Result twice) I reduce the run time by a small factor but the duration actually gets worse.
FFT+Kernel -> 0.015936ms
FFT+CB -> 0.020512ms

I am thankful for any suggestions.

Thanks for your question, Alexander. cuFFT callbacks use device-side function calls. If the compiler has control over both the device function and the call site, it can often inline __device__ functions, resulting in zero overhead. However, for cuFFT callbacks the compiler does not have control over the call site, which lives inside the cuFFT library. Instead, as described in the blog post, you need to compile your callback as relocatable device code. The lack of inlining can incur some small, but non-zero overhead for each call (see Separate Compilation and Linking of CUDA C++ Device Code). Your callback function is doing relatively little work, only multiplying each element by a constant value, which may just not be enough to compensate for the call overhead. In such cases, running a separate kernel can sometimes result in better overall performance.

For your code, you can try the following two things and check if you see an improvement:
o Pass the filter through a __constant__ or __device__ variable declared outside of the callback function - this would be more efficient than passing through callerInfo.
o __managed__ memory for storing callback functions may introduce some one-time setup cost. When measuring your execution time make sure you run the kernel multiple times and discard the first run. If there is measurable setup cost for __managed__ and that is a problem for your application, you can get rid of __managed__ and do it the traditional way.

If this doesn't work I suggest you keep the separate kernel (and maybe try callbacks again in future cuFFT versions as there are regular improvements to the performance).