Please let us know if I am missing something (data layout, cuFFT version, memory type etc), because the website claims lower precisions to have higher throughput.
I note that the performance comparison between FP32 and FP16 may depend on the shape of the FFT. I don’t have a Jetson to test on, but on a few other GPUs I’ve tried, I notice that the FP16 path is faster for the size of (32,16384). Example, CUDA 12.0, GTX1660Ti:
$ ./t32
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.1043 ms
trial #1, elapsed: 0.1049 ms
trial #2, elapsed: 0.1023 ms
trial #3, elapsed: 0.1035 ms
trial #4, elapsed: 0.1031 ms
trial #5, elapsed: 0.1041 ms
trial #6, elapsed: 0.1040 ms
trial #7, elapsed: 0.1037 ms
trial #8, elapsed: 0.1044 ms
trial #9, elapsed: 0.1039 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0715 ms
trial #1, elapsed: 0.0734 ms
trial #2, elapsed: 0.0736 ms
trial #3, elapsed: 0.0725 ms
trial #4, elapsed: 0.0734 ms
trial #5, elapsed: 0.0722 ms
trial #6, elapsed: 0.0738 ms
trial #7, elapsed: 0.0719 ms
trial #8, elapsed: 0.0733 ms
trial #9, elapsed: 0.0719 ms
Its just an observation. I suggest to wait to see what comes of the bug.
@Robert_Crovella Thank you for your response! I do see a 2x speed up with this shape on NVIDIA GeForce RTX 3060 GPU, yet to test this on a jetson device. However, my use case is closer to shape similar to (16, 1024).
./fft_benchmark [19:53:56]
FFT shape: (32, 16384)
============ float32 fft test ===============
trial #0, elapsed: 0.0821 ms
trial #1, elapsed: 0.0829 ms
trial #2, elapsed: 0.0829 ms
trial #3, elapsed: 0.0819 ms
trial #4, elapsed: 0.0809 ms
trial #5, elapsed: 0.0819 ms
trial #6, elapsed: 0.0809 ms
trial #7, elapsed: 0.0819 ms
trial #8, elapsed: 0.0819 ms
trial #9, elapsed: 0.0848 ms
============ float16 fft test ===============
workSize: 2097152
trial #0, elapsed: 0.0461 ms
trial #1, elapsed: 0.0481 ms
trial #2, elapsed: 0.0471 ms
trial #3, elapsed: 0.0471 ms
trial #4, elapsed: 0.0471 ms
trial #5, elapsed: 0.0471 ms
trial #6, elapsed: 0.0460 ms
trial #7, elapsed: 0.0481 ms
trial #8, elapsed: 0.0470 ms
trial #9, elapsed: 0.0461 ms