Half precision cuFFT Transforms

JanWagner · November 4, 2016, 7:26am

I’m trying to check FP16 performance of CUFFT. The CUDA Toolkit Documentation for CUDA 7.5 and for CUDA 8.0 claims under http://docs.nvidia.com/cuda/cufft/#introduction

This version of the cuFFT library supports the following features:
...
- Half-precision (16-bit floating point), single-precision (32-bit floating point) and double-precision (64-bit floating point).
...

Similarly section 2.3.1. http://docs.nvidia.com/cuda/cufft/#half-precision-transforms
indicates that half precision transforms are supported.

However, the neither documentation, any of the header files cufft*.h, the types in cufftType_t, nor anything in cuda_fp16.h gave me any hints as to how to actually run such transforms :-(

What am I missing?

Or is this a documentation bug?

Robert_Crovella · November 10, 2016, 6:18pm

Take a look at cufftXtMakePlanMany

[url]http://docs.nvidia.com/cuda/cufft/index.html#function-cufftxtmakeplanmany[/url]

JanWagner · November 11, 2016, 2:33am

Thanks! Tried in CUDA 8.0 but it doesn’t work. I thought I’m following the documentation but might be something wrong after all… Basic code is:

size_t fftlen = 1024, batch = 1024;
size_t worksizesXt[10];
long long dimnXt[1] = { fftlen };

cudaSetDevice(0);
cufftCreate(&fftplan_r2cXt_fp16);
cufftXtMakePlanMany(r2cXt,
  1, dimnXt,
  NULL, 1, fftlen, // NULL: latter 2 args are ignored
  CUDA_R_16F,
  NULL, 1, fftlen/2+1, // NULL: latter 2 args are ignored
  CUDA_C_16F,
  batch,
  worksizesXt,
  CUDA_C_16F
);

The call to cufftXtMakePlanMany returns 0xB (invalid device). If I add a call to cufftXtSetGPUs before it with just 1 GPU then cufftXtSetGPUs itself returns 0x4 (invalid value). If I specify 2 GPU then cufftXtSetGPUs returns fine but cufftXtMakePlanMany still returns 0xB (invalid device).

Cannot find any online examples for cufftXtMakePlanMany() either.

Do you know how to correctly use cufftXtMakePlanMany()…?

JanWagner · November 11, 2016, 2:59am

Hmm maybe did not work since the board was GTX TITAN X. Now I ran the code on GTX 1080 and cufftXtMakePlanMany() returns successfully and a later cufftXtExec() succeeds. Throughput is about 1/4th that of 32-bit floating point though, quite disappointing. Presumably a Pascal TITAN X or Pascal TESLA card would be needed for any speed benefit in CUFFT 16-bit over 32-bit floating point…?

Robert_Crovella · November 11, 2016, 3:57am

The only current non-Jetson GPU that will show interesting throughput increase for FP16 is Tesla P100.

All Pascal family members support FP16 computation, but for most it is very slow compared to FP32.

Indeed, you need a Pascal family GPU (or a Jetson TX1) to support any of the library half precision functions or native FP16 of any kind.

Robert_Crovella · March 15, 2017, 4:18am

Here is a worked example for anyone curious:

#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <cufftXt.h>
#include <cuda_fp16.h>
#include <assert.h>

typedef half2 ftype;
long long sig_size = 1<<23;


int main(){

  ftype *h_idata = (ftype *)malloc(sig_size*sizeof(ftype));
  ftype *d_idata;
  ftype *d_odata;
  cudaMalloc(&d_idata, sizeof(ftype)*sig_size);
  cudaMalloc(&d_odata, sizeof(ftype)*sig_size);
  cufftHandle plan;
  cufftResult r;
  r = cufftCreate(&plan);
  assert(r == CUFFT_SUCCESS);
  size_t ws = 0;
  r = cufftXtMakePlanMany(plan, 1,  &sig_size, NULL, 1, 1, CUDA_C_16F, NULL, 1, 1, CUDA_C_16F, 1, &ws, CUDA_C_16F);
  assert(r == CUFFT_SUCCESS);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD); // warm-up
  assert(r == CUFFT_SUCCESS);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaEventRecord(start);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD);
  assert(r == CUFFT_SUCCESS);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  printf("forward FFT time for %ld samples: %fms\n", sig_size, et);
  return 0;
}

It needs to be compiled for a GPU architecture of cc 5.3 or higher.

charlie.pearson · April 11, 2019, 11:06am

I have just picked up this example, as I am looking at using half precision FFTs, but I can’t get it working. When I try to run the worked example it fails in cufftXtMakePlanMany with the result CUFFT_NOT_SUPPORTED.

For reference, when I switch ftype to float it all works fine.

Any ideas where I’m going wrong?

Robert_Crovella · April 11, 2019, 11:58am

What GPU are you running on, what CUDA version are you using, and what is your compile command line.

charlie.pearson · April 11, 2019, 12:51pm

Thanks for the quick reply, but I have now actually managed to get it working.

I understand that the half precision is generally slower on Pascal architecture, but have read in various places about how this has changed in Volta. Can you point me to somewhere I could find out more about this?

Ultimately I am hoping to do a pile of signal processing on a Jetson Xavier, and would be interested to know whether / how I can use half precision to speed things up.

cbl9.5 · July 28, 2019, 2:52am

Robert_Crovella:

Here is a worked example for anyone curious:

#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <cufftXt.h>
#include <cuda_fp16.h>
#include <assert.h>

typedef half2 ftype;
long long sig_size = 1<<23;


int main(){

  ftype *h_idata = (ftype *)malloc(sig_size*sizeof(ftype));
  ftype *d_idata;
  ftype *d_odata;
  cudaMalloc(&d_idata, sizeof(ftype)*sig_size);
  cudaMalloc(&d_odata, sizeof(ftype)*sig_size);
  cufftHandle plan;
  cufftResult r;
  r = cufftCreate(&plan);
  assert(r == CUFFT_SUCCESS);
  size_t ws = 0;
  r = cufftXtMakePlanMany(plan, 1,  &sig_size, NULL, 1, 1, CUDA_C_16F, NULL, 1, 1, CUDA_C_16F, 1, &ws, CUDA_C_16F);
  assert(r == CUFFT_SUCCESS);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD); // warm-up
  assert(r == CUFFT_SUCCESS);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaEventRecord(start);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD);
  assert(r == CUFFT_SUCCESS);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  printf("forward FFT time for %ld samples: %fms\n", sig_size, et);
  return 0;
}

It needs to be compiled for a GPU architecture of cc 5.3 or higher.

What about 2d fft? I set rank to 2 and got CUFFT_INVALID_VALUE error. Working on TX2 by the way.

wlelectronics · April 1, 2020, 2:53pm

I tested the performance of float cufft and FP 16 CUFFT on Quadro Gp100. But the result shows that time consumption of float cufft is a little lower than FP16 CUFFT. Since the computation capability of Gp100 is 6.0, the result makes me really confused. Can you tell me why it is like this ?

hazelnutvt04 · March 26, 2021, 8:36pm

Thanks for the example code!

Minor pedantic note. My version of gcc, 9.3, flagged a warning in the very last printf:

printf(“forward FFT time for %ld samples: %fms\n”, sig_size, et);

where ‘long int’ was expected for %ld, but sig_size is ‘long long’

Changing the format string from %ld to %lld got rid of the warning.

mnicely · March 29, 2021, 1:36pm

Hi cbl9.5,

There’s more to 2D than just changing rank.
Please refer to data-layout
Or
twod-advanced-data-layout-use
And
SO-example

Topic		Replies	Views
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	2843	October 18, 2021
Large data size for cuFFT GPU-Accelerated Libraries	8	3932	September 8, 2018
cuFFT 2D on FP16 2D array GPU-Accelerated Libraries cufft	5	1395	March 13, 2023
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3818	October 19, 2022
accuracy of CUFFT under double precision CUDA Programming and Performance	9	4146	September 18, 2009
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15487	May 4, 2010
Trouble with cuFFT on multiple GPUs GPU-Accelerated Libraries	13	3707	August 26, 2017
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13495	October 27, 2010
Profiling using cuFFT GPU-Accelerated Libraries	9	841	December 5, 2019
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1664	October 12, 2021

Half precision cuFFT Transforms

Related topics