Half precision cuFFT Transforms

I’m trying to check FP16 performance of CUFFT. The CUDA Toolkit Documentation for CUDA 7.5 and for CUDA 8.0 claims under http://docs.nvidia.com/cuda/cufft/#introduction

This version of the cuFFT library supports the following features:
- Half-precision (16-bit floating point), single-precision (32-bit floating point) and double-precision (64-bit floating point).

Similarly section 2.3.1. http://docs.nvidia.com/cuda/cufft/#half-precision-transforms
indicates that half precision transforms are supported.

However, the neither documentation, any of the header files cufft*.h, the types in cufftType_t, nor anything in cuda_fp16.h gave me any hints as to how to actually run such transforms :-(

What am I missing?

Or is this a documentation bug?

Take a look at cufftXtMakePlanMany


Thanks! Tried in CUDA 8.0 but it doesn’t work. I thought I’m following the documentation but might be something wrong after all… Basic code is:

size_t fftlen = 1024, batch = 1024;
size_t worksizesXt[10];
long long dimnXt[1] = { fftlen };

  1, dimnXt,
  NULL, 1, fftlen, // NULL: latter 2 args are ignored
  NULL, 1, fftlen/2+1, // NULL: latter 2 args are ignored

The call to cufftXtMakePlanMany returns 0xB (invalid device). If I add a call to cufftXtSetGPUs before it with just 1 GPU then cufftXtSetGPUs itself returns 0x4 (invalid value). If I specify 2 GPU then cufftXtSetGPUs returns fine but cufftXtMakePlanMany still returns 0xB (invalid device).

Cannot find any online examples for cufftXtMakePlanMany() either.

Do you know how to correctly use cufftXtMakePlanMany()…?

Hmm maybe did not work since the board was GTX TITAN X. Now I ran the code on GTX 1080 and cufftXtMakePlanMany() returns successfully and a later cufftXtExec() succeeds. Throughput is about 1/4th that of 32-bit floating point though, quite disappointing. Presumably a Pascal TITAN X or Pascal TESLA card would be needed for any speed benefit in CUFFT 16-bit over 32-bit floating point…?

The only current non-Jetson GPU that will show interesting throughput increase for FP16 is Tesla P100.

All Pascal family members support FP16 computation, but for most it is very slow compared to FP32.

Indeed, you need a Pascal family GPU (or a Jetson TX1) to support any of the library half precision functions or native FP16 of any kind.

Here is a worked example for anyone curious:

#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <cufftXt.h>
#include <cuda_fp16.h>
#include <assert.h>

typedef half2 ftype;
long long sig_size = 1<<23;

int main(){

  ftype *h_idata = (ftype *)malloc(sig_size*sizeof(ftype));
  ftype *d_idata;
  ftype *d_odata;
  cudaMalloc(&d_idata, sizeof(ftype)*sig_size);
  cudaMalloc(&d_odata, sizeof(ftype)*sig_size);
  cufftHandle plan;
  cufftResult r;
  r = cufftCreate(&plan);
  assert(r == CUFFT_SUCCESS);
  size_t ws = 0;
  r = cufftXtMakePlanMany(plan, 1,  &sig_size, NULL, 1, 1, CUDA_C_16F, NULL, 1, 1, CUDA_C_16F, 1, &ws, CUDA_C_16F);
  assert(r == CUFFT_SUCCESS);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD); // warm-up
  assert(r == CUFFT_SUCCESS);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD);
  assert(r == CUFFT_SUCCESS);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  printf("forward FFT time for %ld samples: %fms\n", sig_size, et);
  return 0;

It needs to be compiled for a GPU architecture of cc 5.3 or higher.

I have just picked up this example, as I am looking at using half precision FFTs, but I can’t get it working. When I try to run the worked example it fails in cufftXtMakePlanMany with the result CUFFT_NOT_SUPPORTED.

For reference, when I switch ftype to float it all works fine.

Any ideas where I’m going wrong?

What GPU are you running on, what CUDA version are you using, and what is your compile command line.

Thanks for the quick reply, but I have now actually managed to get it working.

I understand that the half precision is generally slower on Pascal architecture, but have read in various places about how this has changed in Volta. Can you point me to somewhere I could find out more about this?

Ultimately I am hoping to do a pile of signal processing on a Jetson Xavier, and would be interested to know whether / how I can use half precision to speed things up.

What about 2d fft? I set rank to 2 and got CUFFT_INVALID_VALUE error. Working on TX2 by the way.

I tested the performance of float cufft and FP 16 CUFFT on Quadro Gp100. But the result shows that time consumption of float cufft is a little lower than FP16 CUFFT. Since the computation capability of Gp100 is 6.0, the result makes me really confused. Can you tell me why it is like this ?