cuFFT 2D on FP16 2D array

ron.persky · November 22, 2020, 9:05am

Hi all,

I’m trying to perform cuFFT 2D on 2D array of type __half2.
I am doing so by using cufftXtMakePlanMany and cufftXtExec, but I am getting “inf” and “nan” values - so something is wrong.
The 2D array is data of Radar with Nsamples x Nchirps.

Below is my configuration for the cuFFT plan and execution.

    cufftHandle plan;
    cufftCreate(&plan);
    int rank = 2;
    int batch = 1;
    size_t ws = 0;
    long long size_arr[rank] = {Nsamples, Nchirps};
    long long int idist = Nsamples;
    long long int odist = Nsamples;
    int istride = 1;
    int ostride = 1;
    if(cufftXtMakePlanMany(plan, rank, size_arr, NULL, istride, idist, CUDA_C_16F, NULL, ostride, odist, CUDA_C_16F, batch, &ws, CUDA_C_16F) != CUFFT_SUCCESS)
    {
         printf("cufftXtMakePlanMany Error\n");
    }

    if(cufftXtExec(plan, devInData, devOutData, CUFFT_FORWARD) != CUFFT_SUCCESS)
    {
        printf("cufftXtExec 1 Error\n");
    }
    cudaDeviceSynchronize();

What am I missing ?

Thanks,
Ron

Robert_Crovella · November 22, 2020, 7:11pm

This part of the docs may be of interest:

Half precision transforms might not be suitable for all kinds of problems due to limited range represented by half precision floating point arithmetics. Please note that the first element of FFT result is the sum of all input elements and it is likely to overflow for certain inputs.

If a similar transform setup is working for you in the 32-bit case, then this may be an issue.

As you can also probably now imagine, your report may have a data dependency. Therefore, debugging your case may require the actual data, not just the code you have shown.

Robert_Crovella · November 22, 2020, 7:33pm

FWIW, building a simple, complete test case around what you have shown seems to produce a valid result:

$ cat t11.cu
#include <cstdio>
#include <cufft.h>
#include <cuda_fp16.h>
#include <cufftXt.h>
int main(){

    half2 *devInData, *devOutData, *hInData, *hOutData;
    const int Nsamples = 64;
    const int Nchirps = 64;
    cudaMalloc(&devInData,    Nsamples*Nchirps*sizeof(devInData[0]));
    cudaMalloc(&devOutData,   Nsamples*Nchirps*sizeof(devOutData[0]));
    hInData = new half2[Nsamples*Nchirps];
    hOutData = new half2[Nsamples*Nchirps];
    for (int i = 0; i < Nsamples*Nchirps; i++) hInData[i] = half2(1.0, 0.0);
    cudaMemcpy(devInData, hInData, Nsamples*Nchirps*sizeof(devInData[0]), cudaMemcpyHostToDevice);
    cufftHandle plan;
    cufftCreate(&plan);
    const int rank = 2;
    int batch = 1;
    size_t ws = 0;
    long long size_arr[rank] = {Nsamples, Nchirps};
    long long int idist = Nsamples;
    long long int odist = Nsamples;
    int istride = 1;
    int ostride = 1;
    if(cufftXtMakePlanMany(plan, rank, size_arr, NULL, istride, idist, CUDA_C_16F, NULL, ostride, odist, CUDA_C_16F, batch, &ws, CUDA_C_16F) != CUFFT_SUCCESS)
    {
         printf("cufftXtMakePlanMany Error\n");
    }

    if(cufftXtExec(plan, devInData, devOutData, CUFFT_FORWARD) != CUFFT_SUCCESS)
    {
        printf("cufftXtExec 1 Error\n");
    }
    cudaDeviceSynchronize();
    cudaMemcpy(hOutData, devOutData, Nsamples*Nchirps*sizeof(devOutData[0]), cudaMemcpyDeviceToHost);
    printf("%f\n", (float)(hOutData[0].x));
}
$ nvcc -o t11 t11.cu -lcufft -std=c++11
$ ./t11
4096.000000
$

ron.persky · November 23, 2020, 10:39am

Hi Robert,

Thank you for the quick and detailed response.
I have moved to the cufftPlan2D APIs and using now FP32.
Now it is working, so it might have been the precision issue.

In any case the, the cufftPlan2D FP32 is faster then the cufftXtMakePlanMany FP16 - so I’ll be using that.

Best regards,
Ron

Robert_Crovella · March 13, 2023, 2:22pm

Regarding FP16 vs. FP32 performance, this may be of interest.

Topic		Replies	Views
Half precision cuFFT Transforms GPU-Accelerated Libraries	12	6350	March 29, 2021
Half2 cufft performance CUDA Programming and Performance	0	830	April 7, 2020
Trouble with cufftXtMakePlanMany and half-precision C2R GPU-Accelerated Libraries cufft	2	549	August 12, 2023
Cufft2d FP16 and BF16 is slower than FP32 GPU-Accelerated Libraries cufft	1	783	June 9, 2023
cufftXtMakePlanMany fp16 data size limiation GPU-Accelerated Libraries cuda , cufft	0	28	April 30, 2026
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	5	3123	June 11, 2020
CuFFT FP16 is slower that FP32 Jetson Xavier NX cuda	4	1459	March 14, 2023
about cufft for 2D array CUDA Programming and Performance	0	1190	January 9, 2018
cufftExecR2C only gives half the answer..?! CUDA Programming and Performance	2	4304	July 24, 2009
CUFFT not a power of two element CUDA Programming and Performance	6	8550	February 27, 2010

cuFFT 2D on FP16 2D array

Related topics