Does cufft optimized by the tensor cores?

lizhihao.cs · October 30, 2019, 7:46pm

I am doing some FFT programming, and using the cuBLAS’s GEMM to accelerate the algorithm. But the question comes to my mind: is cufft optimized by taking advantage of tensor cores? If so, I wanna directly call the cufft library.

mnicely · September 14, 2020, 3:16pm

No, cuFFT doesn’t currently utilize Tensor Cores.

ahadji05 · May 31, 2021, 12:20pm

Hello, I see this question was posted 11 months ago and I would like to address it again in case there have been any new updates since then!

I recently did some benchmarks for 1D Batched FFTs on a Tesla V100 GPU and obtained at max 2.3 TFLOPS/sec. for single-precision complex numbers.

I used the CUDA 11.1.1 version, where Tensor-Cores enabled or?

mnicely · June 1, 2021, 3:03pm

cuFFT still doesn’t use Tensor Cores.

ahadji05 · June 3, 2021, 9:21am

Okay, thanks for the update!

julien.plante2 · March 16, 2022, 10:30am

Hi,

Sorry to revive this old question, but could you elaborate on why does’nt cuFFT use Tensor Cores ?

I understand that the FFT is generally considered as memory-bound, so I guess that the expected gain of using Tensor Cores is not much. But is it actually the case ? Is there absolutely no case where Tensor Cores can be beneficial when computing a Fourier Transform ?

I am also wondering if a “bruteforce” DFT implementation could be useful, in a similar fashion than the implicit GEMM convolution. I might not mind jumping from O(n log n) to O(n²) if this can help reducing the memory bottleneck, and I suppose that this could be a perfect job for the Tensor Cores. We might go outside the scope of cuFFT here, but maybe this could fit into CUTLASS ?

My use case is relatively short FFTs (max 8192 points), but with a huge number of batches (> 100 000).
Thanks for giving any pointers about this.

Julien

mnicely · March 16, 2022, 7:00pm

If you are trying to add custom operations before and after your FFTs, I highly suggest you checkout cuFFTDx, available in our MathDx package. If allows you to fuse math operation straight into a CUDA kernel. For FFTs, this removes the memory-bound issue.

You can find examples here and here.

julien.plante2 · March 23, 2022, 7:10am

Hi, thanks a lot for this suggestion.
I currently use cuFFT callbacks, but cuFFTDx definitely looks more flexible, I’ll try it.

Any input on a Tensor Cores powered FFT ?

mnicely · March 23, 2022, 1:18pm

I understand that the FFT is generally considered as memory-bound, so I guess that the expected gain of using Tensor Cores is not much. But is it actually the case ?

Yes, this is the case.

lionel.matias · November 23, 2022, 9:35pm

Hi, it seems it is possible to use Tensor Cores for half-precision FFT and go faster than cuFFT
[2104.11471] tcFFT: Accelerating Half-Precision FFT through Tensor Cores (arxiv.org).