CuFFT FP16 is slower that FP32

Hi everyone,

I am comparing the cuFFT performance of FP32 vs FP16 with the expectation that FP16 throughput should be at least twice with respect to FP32.

I am aware of the existence of the following similar threads on this forum

  1. 2D-FFT Benchmarks on Jetson AGX with various precisions
    No conclusive action - issue was closed due to inactivity

  2. cuFFT 2D on FP16 2D array - #3 by Robert_Crovella
    The OP moved to FP32 because it was faster.

None of the threads provided a conclusive solution to this problem.

I am running this code on a Jetson Xavier NX with a compute capability of 7.2 with jetson_clocks turned on

Here is the code. It creates a single batch of test data of shape 2048 x 1024 and does 2D FFT on it - which is our use case.
fft32_vs_16.cpp (6.1 KB)

This is the output that I am getting:

This code was compiled with the below command.

sudo jetson_clocks 
nvcc fft_32_vs_16.cpp -L/usr/local/cuda-11.4/lib64  -lcudart -lcufft -o fft_benchmark

Please let us know if I am missing something (data layout, cuFFT version, memory type etc), because the website claims lower precisions to have higher throughput.

Hi

We are checking this issue internally.
Will share more information with you later.

Thanks.

I note that the performance comparison between FP32 and FP16 may depend on the shape of the FFT. I don’t have a Jetson to test on, but on a few other GPUs I’ve tried, I notice that the FP16 path is faster for the size of (32,16384). Example, CUDA 12.0, GTX1660Ti:

$ ./t32
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.1043 ms
trial #1, elapsed: 0.1049 ms
trial #2, elapsed: 0.1023 ms
trial #3, elapsed: 0.1035 ms
trial #4, elapsed: 0.1031 ms
trial #5, elapsed: 0.1041 ms
trial #6, elapsed: 0.1040 ms
trial #7, elapsed: 0.1037 ms
trial #8, elapsed: 0.1044 ms
trial #9, elapsed: 0.1039 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0715 ms
trial #1, elapsed: 0.0734 ms
trial #2, elapsed: 0.0736 ms
trial #3, elapsed: 0.0725 ms
trial #4, elapsed: 0.0734 ms
trial #5, elapsed: 0.0722 ms
trial #6, elapsed: 0.0738 ms
trial #7, elapsed: 0.0719 ms
trial #8, elapsed: 0.0733 ms
trial #9, elapsed: 0.0719 ms

Its just an observation. I suggest to wait to see what comes of the bug.

@Robert_Crovella Thank you for your response! I do see a 2x speed up with this shape on NVIDIA GeForce RTX 3060 GPU, yet to test this on a jetson device. However, my use case is closer to shape similar to (16, 1024).

./fft_benchmark                                                                                              [19:53:56]
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.0821 ms
trial #1, elapsed: 0.0829 ms
trial #2, elapsed: 0.0829 ms
trial #3, elapsed: 0.0819 ms
trial #4, elapsed: 0.0809 ms
trial #5, elapsed: 0.0819 ms
trial #6, elapsed: 0.0809 ms
trial #7, elapsed: 0.0819 ms
trial #8, elapsed: 0.0819 ms
trial #9, elapsed: 0.0848 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0461 ms
trial #1, elapsed: 0.0481 ms
trial #2, elapsed: 0.0471 ms
trial #3, elapsed: 0.0471 ms
trial #4, elapsed: 0.0471 ms
trial #5, elapsed: 0.0471 ms
trial #6, elapsed: 0.0460 ms
trial #7, elapsed: 0.0481 ms
trial #8, elapsed: 0.0470 ms
trial #9, elapsed: 0.0461 ms

Hi,

Could you try cuFFTDx? For example:

Currently, cuFFT carries a limited set of optimizations for fp16.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.