Unexpectedly low performance of cuFFT with half floating point (FP16)

ClaudioC · June 16, 2017, 9:03am

With NVIDIA GPUs that offer full support to half floating point (FP16) I was expecting a 2x processing time performance boost with FP16 compared to single precision floating point (FP32).

I have run repeatable benchmarks in controlled conditions with NVIDIA Tesla P100 and Jetson TX2 and in both cases the 2x performance boost is only available with very small FFT sizes, resp smaller than 2^10 with P100 and 2^13 with Jetson TX2.

Full results, together with the source code of the benchmark run, are available in this public git repository:

https://bitbucket.org/ccicconetti/mbi_cuda_snippets/src/4d250095f5d7/BenchmarkFp16/?at=master

A direct link to the P100 vs Jetson TX2 results is:

External Image

Has anybody stumbled upon similar results when experimenting with FP16?

njuffa · June 16, 2017, 6:19pm

To the best of my knowledge, large FFTs are always limited by memory throughput, not by compute throughput. While using a narrower data type should theoretically result in higher “effective” memory throughput per second (measured in numbers/second rather than bytes/second), those narrower data types could also lead to reduced efficiency of memory accesses. So while I would not expect a 2x performance increase throughout, there should still be a meaningful incremental performance increase from using narrower data types.

Your graphs seem to be showing that this is not the case for large FFTs. This may indicate that the FFT code is not fully optimized for FP16 for large FFTs, either because these use cases have lesser importance in the market (*) or because there are technical difficulties (e.g. accuracy issues). I would suggest filing an RFE (request for enhancement) with NVIDIA, which you can via the bug reporting web form linked from the CUDA registered developer website. Simply prefix the synopsis with “RFE:” to mark it as an RFE rather than a functional bug.

i For any sufficiently large library it is economically infeasible to fully optimize all possible variants of a particular functionality. Which variants get the most attention from software developers is typically prioritized based on market demands[/i]

Topic		Replies	Views
Half precision cuFFT Transforms GPU-Accelerated Libraries	12	6025	March 29, 2021
CuFFT FP16 is slower that FP32 Jetson Xavier NX cuda	5	1167	April 5, 2023
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	2788	October 18, 2021
Cufft2d FP16 and BF16 is slower than FP32 GPU-Accelerated Libraries cufft	1	676	June 9, 2023
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13490	October 27, 2010
Realistic Throughput for cuFFT GPU-Accelerated Libraries	6	1554	February 18, 2019
FP16 vs FP32 CUDA Programming and Performance	3	2382	May 23, 2019
converting fp32 math to fp16 fails to give speed up CUDA Programming and Performance	5	1499	November 21, 2017
CUFFT: calculation time CUDA Programming and Performance	6	2671	April 21, 2012
FP16 and CUFFT GPU-Accelerated Libraries	0	1173	July 27, 2015

Unexpectedly low performance of cuFFT with half floating point (FP16)

Related topics