CuFFT FP16 is slower that FP32

shubhamp · March 13, 2023, 6:49am

Hi everyone,

I am comparing the cuFFT performance of FP32 vs FP16 with the expectation that FP16 throughput should be at least twice with respect to FP32.

I am aware of the existence of the following similar threads on this forum

2D-FFT Benchmarks on Jetson AGX with various precisions
No conclusive action - issue was closed due to inactivity
cuFFT 2D on FP16 2D array - #3 by Robert_Crovella
The OP moved to FP32 because it was faster.

None of the threads provided a conclusive solution to this problem.

I am running this code on a Jetson Xavier NX with a compute capability of 7.2 with jetson_clocks turned on

Here is the code. It creates a single batch of test data of shape 2048 x 1024 and does 2D FFT on it - which is our use case.
fft32_vs_16.cpp (6.1 KB)

This is the output that I am getting:

This code was compiled with the below command.

sudo jetson_clocks 
nvcc fft_32_vs_16.cpp -L/usr/local/cuda-11.4/lib64  -lcudart -lcufft -o fft_benchmark

Please let us know if I am missing something (data layout, cuFFT version, memory type etc), because the website claims lower precisions to have higher throughput.

AastaLLL · March 13, 2023, 7:56am

Hi

We are checking this issue internally.
Will share more information with you later.

Thanks.

Robert_Crovella · March 13, 2023, 2:21pm

I note that the performance comparison between FP32 and FP16 may depend on the shape of the FFT. I don’t have a Jetson to test on, but on a few other GPUs I’ve tried, I notice that the FP16 path is faster for the size of (32,16384). Example, CUDA 12.0, GTX1660Ti:

$ ./t32
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.1043 ms
trial #1, elapsed: 0.1049 ms
trial #2, elapsed: 0.1023 ms
trial #3, elapsed: 0.1035 ms
trial #4, elapsed: 0.1031 ms
trial #5, elapsed: 0.1041 ms
trial #6, elapsed: 0.1040 ms
trial #7, elapsed: 0.1037 ms
trial #8, elapsed: 0.1044 ms
trial #9, elapsed: 0.1039 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0715 ms
trial #1, elapsed: 0.0734 ms
trial #2, elapsed: 0.0736 ms
trial #3, elapsed: 0.0725 ms
trial #4, elapsed: 0.0734 ms
trial #5, elapsed: 0.0722 ms
trial #6, elapsed: 0.0738 ms
trial #7, elapsed: 0.0719 ms
trial #8, elapsed: 0.0733 ms
trial #9, elapsed: 0.0719 ms

Its just an observation. I suggest to wait to see what comes of the bug.

shubhamp · March 13, 2023, 2:31pm

@Robert_Crovella Thank you for your response! I do see a 2x speed up with this shape on NVIDIA GeForce RTX 3060 GPU, yet to test this on a jetson device. However, my use case is closer to shape similar to (16, 1024).

./fft_benchmark                                                                                              [19:53:56]
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.0821 ms
trial #1, elapsed: 0.0829 ms
trial #2, elapsed: 0.0829 ms
trial #3, elapsed: 0.0819 ms
trial #4, elapsed: 0.0809 ms
trial #5, elapsed: 0.0819 ms
trial #6, elapsed: 0.0809 ms
trial #7, elapsed: 0.0819 ms
trial #8, elapsed: 0.0819 ms
trial #9, elapsed: 0.0848 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0461 ms
trial #1, elapsed: 0.0481 ms
trial #2, elapsed: 0.0471 ms
trial #3, elapsed: 0.0471 ms
trial #4, elapsed: 0.0471 ms
trial #5, elapsed: 0.0471 ms
trial #6, elapsed: 0.0460 ms
trial #7, elapsed: 0.0481 ms
trial #8, elapsed: 0.0470 ms
trial #9, elapsed: 0.0461 ms

AastaLLL · March 14, 2023, 2:53am

Hi,

Could you try cuFFTDx? For example:

Currently, cuFFT carries a limited set of optimizations for fp16.

Thanks.

system · April 5, 2023, 3:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cufft2d FP16 and BF16 is slower than FP32 GPU-Accelerated Libraries cufft	1	681	June 9, 2023
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	2820	October 18, 2021
Half precision cuFFT Transforms GPU-Accelerated Libraries	12	6052	March 29, 2021
FP16 vs FP32 CUDA Programming and Performance	3	2385	May 23, 2019
Unexpectedly low performance of cuFFT with half floating point (FP16) GPU-Accelerated Libraries	1	1660	June 16, 2017
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4223	October 18, 2021
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13492	October 27, 2010
How can I get good performance from cuFFT? GPU-Accelerated Libraries	2	1409	June 8, 2016
Batched 1D FFT not faster than a loop for big images (1024x1024) GPU-Accelerated Libraries cuda	0	481	September 25, 2020
On Jetson Xavier, which is faster: pseudo FP16 or true FP16? Jetson AGX Xavier tensorrt	5	500	June 29, 2022

CuFFT FP16 is slower that FP32

Related topics