cuFFT 2D on FP16 2D array

Hi all,

I’m trying to perform cuFFT 2D on 2D array of type __half2.
I am doing so by using cufftXtMakePlanMany and cufftXtExec, but I am getting “inf” and “nan” values - so something is wrong.
The 2D array is data of Radar with Nsamples x Nchirps.

Below is my configuration for the cuFFT plan and execution.

    cufftHandle plan;
    cufftCreate(&plan);
    int rank = 2;
    int batch = 1;
    size_t ws = 0;
    long long size_arr[rank] = {Nsamples, Nchirps};
    long long int idist = Nsamples;
    long long int odist = Nsamples;
    int istride = 1;
    int ostride = 1;
    if(cufftXtMakePlanMany(plan, rank, size_arr, NULL, istride, idist, CUDA_C_16F, NULL, ostride, odist, CUDA_C_16F, batch, &ws, CUDA_C_16F) != CUFFT_SUCCESS)
    {
         printf("cufftXtMakePlanMany Error\n");
    }

    if(cufftXtExec(plan, devInData, devOutData, CUFFT_FORWARD) != CUFFT_SUCCESS)
    {
        printf("cufftXtExec 1 Error\n");
    }
    cudaDeviceSynchronize();

What am I missing ?

Thanks,
Ron

This part of the docs may be of interest:

Half precision transforms might not be suitable for all kinds of problems due to limited range represented by half precision floating point arithmetics. Please note that the first element of FFT result is the sum of all input elements and it is likely to overflow for certain inputs.

If a similar transform setup is working for you in the 32-bit case, then this may be an issue.

As you can also probably now imagine, your report may have a data dependency. Therefore, debugging your case may require the actual data, not just the code you have shown.

1 Like

FWIW, building a simple, complete test case around what you have shown seems to produce a valid result:

$ cat t11.cu
#include <cstdio>
#include <cufft.h>
#include <cuda_fp16.h>
#include <cufftXt.h>
int main(){

    half2 *devInData, *devOutData, *hInData, *hOutData;
    const int Nsamples = 64;
    const int Nchirps = 64;
    cudaMalloc(&devInData,    Nsamples*Nchirps*sizeof(devInData[0]));
    cudaMalloc(&devOutData,   Nsamples*Nchirps*sizeof(devOutData[0]));
    hInData = new half2[Nsamples*Nchirps];
    hOutData = new half2[Nsamples*Nchirps];
    for (int i = 0; i < Nsamples*Nchirps; i++) hInData[i] = half2(1.0, 0.0);
    cudaMemcpy(devInData, hInData, Nsamples*Nchirps*sizeof(devInData[0]), cudaMemcpyHostToDevice);
    cufftHandle plan;
    cufftCreate(&plan);
    const int rank = 2;
    int batch = 1;
    size_t ws = 0;
    long long size_arr[rank] = {Nsamples, Nchirps};
    long long int idist = Nsamples;
    long long int odist = Nsamples;
    int istride = 1;
    int ostride = 1;
    if(cufftXtMakePlanMany(plan, rank, size_arr, NULL, istride, idist, CUDA_C_16F, NULL, ostride, odist, CUDA_C_16F, batch, &ws, CUDA_C_16F) != CUFFT_SUCCESS)
    {
         printf("cufftXtMakePlanMany Error\n");
    }

    if(cufftXtExec(plan, devInData, devOutData, CUFFT_FORWARD) != CUFFT_SUCCESS)
    {
        printf("cufftXtExec 1 Error\n");
    }
    cudaDeviceSynchronize();
    cudaMemcpy(hOutData, devOutData, Nsamples*Nchirps*sizeof(devOutData[0]), cudaMemcpyDeviceToHost);
    printf("%f\n", (float)(hOutData[0].x));
}
$ nvcc -o t11 t11.cu -lcufft -std=c++11
$ ./t11
4096.000000
$

Hi Robert,

Thank you for the quick and detailed response.
I have moved to the cufftPlan2D APIs and using now FP32.
Now it is working, so it might have been the precision issue.

In any case the, the cufftPlan2D FP32 is faster then the cufftXtMakePlanMany FP16 - so I’ll be using that.

Best regards,
Ron

Regarding FP16 vs. FP32 performance, this may be of interest.