This version of the cuFFT library supports the following features:
...
- Half-precision (16-bit floating point), single-precision (32-bit floating point) and double-precision (64-bit floating point).
...
However, the neither documentation, any of the header files cufft*.h, the types in cufftType_t, nor anything in cuda_fp16.h gave me any hints as to how to actually run such transforms :-(
The call to cufftXtMakePlanMany returns 0xB (invalid device). If I add a call to cufftXtSetGPUs before it with just 1 GPU then cufftXtSetGPUs itself returns 0x4 (invalid value). If I specify 2 GPU then cufftXtSetGPUs returns fine but cufftXtMakePlanMany still returns 0xB (invalid device).
Cannot find any online examples for cufftXtMakePlanMany() either.
Do you know how to correctly use cufftXtMakePlanMany()…?
Hmm maybe did not work since the board was GTX TITAN X. Now I ran the code on GTX 1080 and cufftXtMakePlanMany() returns successfully and a later cufftXtExec() succeeds. Throughput is about 1/4th that of 32-bit floating point though, quite disappointing. Presumably a Pascal TITAN X or Pascal TESLA card would be needed for any speed benefit in CUFFT 16-bit over 32-bit floating point…?
I have just picked up this example, as I am looking at using half precision FFTs, but I can’t get it working. When I try to run the worked example it fails in cufftXtMakePlanMany with the result CUFFT_NOT_SUPPORTED.
For reference, when I switch ftype to float it all works fine.
Thanks for the quick reply, but I have now actually managed to get it working.
I understand that the half precision is generally slower on Pascal architecture, but have read in various places about how this has changed in Volta. Can you point me to somewhere I could find out more about this?
Ultimately I am hoping to do a pile of signal processing on a Jetson Xavier, and would be interested to know whether / how I can use half precision to speed things up.
I tested the performance of float cufft and FP 16 CUFFT on Quadro Gp100. But the result shows that time consumption of float cufft is a little lower than FP16 CUFFT. Since the computation capability of Gp100 is 6.0, the result makes me really confused. Can you tell me why it is like this ?