With NVIDIA GPUs that offer full support to half floating point (FP16) I was expecting a 2x processing time performance boost with FP16 compared to single precision floating point (FP32).
I have run repeatable benchmarks in controlled conditions with NVIDIA Tesla P100 and Jetson TX2 and in both cases the 2x performance boost is only available with very small FFT sizes, resp smaller than 2^10 with P100 and 2^13 with Jetson TX2.
Full results, together with the source code of the benchmark run, are available in this public git repository:
A direct link to the P100 vs Jetson TX2 results is:
Has anybody stumbled upon similar results when experimenting with FP16?