[SOLVED] cuFFT data storage, maybe I'm operating on the wrong elements

I am implementing some signal handling functions and many of them are FFT-related.
One I am having trouble with is the Hilbert Transform, which I implemented after Matlab/Octave hilbert (sort of).
The problem is that, since I don’t know how cuFFT stores the positive/negative frequencies, it is possible that my function is zeroing the wrong elements.

I had a look at the documentation and specially this thread, https://devtalk.nvidia.com/default/topic/793692/gpu-accelerated-libraries/how-cufft-complex-data-is-stored-/, and Njuffa’s explanation here, https://stackoverflow.com/questions/13535182/copying-data-to-cufftcomplex-data-struct

However, his answer addresses the raw copy of data. My case is: where in the array are the negative frequencies?

Below is a complete, fully working program that demonstrates my line of thought. An input array is filled with sine, then the fwd cuFFT is computed, then the “Hilbert” func is run on the complex, then inv FFT is computed and the output is scaled. Both input and output arrays are printed to disk so they can be plotted:

#include <iostream>		// Standard I/O
#include <fstream>		// For file writing
#include <cufft.h>

#ifndef __CUDA_CALL
    #define __CUDA_CALL(call) do {\
        cudaError_t cuda_error = call;\
        if(cuda_error != cudaSuccess) {\
            std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << __FILE__ << ", line " << __LINE__ << std::endl;\
            return -1;}\
        } while(0)
#endif

using namespace std;

// Helper function to write input and output data
__host__ void Write_File(float *output, const int LENGTH, const char *file_name)
	{
	ofstream out_file(file_name);
	for(int i = 0; i < LENGTH; i++)
		out_file << output[i] << endl;

	out_file.close();
	}

__device__ const float PI = 3.141592654f;

// Fills the array in GPU
__global__ void Populate(float *input, const int LENGTH)
    {
    int	tid	= blockDim.x * blockIdx.x + threadIdx.x,
    offset	= gridDim.x * blockDim.x;

    for(int global_idx = tid; global_idx < LENGTH; global_idx += offset)
        input[global_idx] = __sinf(global_idx * 2 * PI * 0.01);
    }

// Hilbert Transform programmed to resemble that of Matlab/Octave
__global__ void Hilbert(cufftComplex *dev_complex, const int COMPLEX_LENGTH, const int LIMIT_1, const int LIMIT_2)
    {
    int	tid	= blockDim.x * blockIdx.x + threadIdx.x,
        offset	= gridDim.x * blockDim.x;

    if((tid > 0) && (tid < LIMIT_1))
        for(int global_idx = tid; global_idx < LIMIT_1; global_idx += offset)
            {
            dev_complex[global_idx].x *= 2;
            dev_complex[global_idx].y *= 2;
            }
    else
        for(int global_idx = LIMIT_2; global_idx < COMPLEX_LENGTH; global_idx += offset)
            {
            dev_complex[global_idx].x = 0;
            dev_complex[global_idx].y = 0;
            }
    }

int main(void)
    {
    const int	LENGTH	= 1024,
                G_SIZE	= 200,
                B_SIZE	= 256,
                CPX_LEN	= LENGTH / 2 + 1,	// Defines the complex array length according to cuFFT documentation
                LIMIT_1	= CPX_LEN % 2 == 0 ? CPX_LEN / 2 : (CPX_LEN + 1) / 2,	// Determines the loop limits according to
                LIMIT_2	= CPX_LEN % 2 == 0 ? LIMIT_1 + 1 : LIMIT_1;		// the complex length being even or odd

    float          *input,
                   *output;
    cufftComplex   *complex;
    cufftHandle    fwd_plan,
                   inv_plan;

    cufftPlan1d(&fwd_plan, LENGTH, CUFFT_R2C, 1);		// Creates forward and inverse plans
    cufftPlan1d(&inv_plan, LENGTH, CUFFT_C2R, 1);

    // Allocates and initializes the managed arrays
    __CUDA_CALL(cudaMallocManaged(&input,  LENGTH * sizeof(float)));
    __CUDA_CALL(cudaMallocManaged(&output, LENGTH * sizeof(float)));
    __CUDA_CALL(cudaMallocManaged(&complex, (CPX_LEN) * sizeof(cufftComplex)));
    __CUDA_CALL(cudaMemset(input,  0, LENGTH * sizeof(float)));
    __CUDA_CALL(cudaMemset(output, 0, LENGTH * sizeof(float)));
    __CUDA_CALL(cudaMemset(complex, 0, (CPX_LEN) * sizeof(cufftComplex)));

    Populate <<< G_SIZE, B_SIZE >>> (input, LENGTH);
    __CUDA_CALL(cudaDeviceSynchronize());

    cufftExecR2C(fwd_plan, (cufftReal *)input, complex);	// Forward FFT
    __CUDA_CALL(cudaDeviceSynchronize());

    Hilbert <<< G_SIZE, B_SIZE >>> (complex, CPX_LEN, LIMIT_1, LIMIT_2);
    __CUDA_CALL(cudaDeviceSynchronize());

    cufftExecC2R(inv_plan, complex, (cufftReal *)output);	// Inverse FFT
    __CUDA_CALL(cudaDeviceSynchronize());

    for(int i = 0; i < LENGTH; i++)			// Scales the iFFT
        output[i] = output[i] / LENGTH;

    Write_File(input,  LENGTH, "Input.txt");
    Write_File(output, LENGTH, "Output.txt");

    __CUDA_CALL(cudaFree(input));
    __CUDA_CALL(cudaFree(output));
    __CUDA_CALL(cudaFree(complex));

    return 0;
    }

You can copy/paste this code and compile with -lcufft. When you run, it will output 2 text files, which you can then inspect in Matlab/Octave:

load Input.txt;
load Output.txt;
plot(Input)
hold on;
plot(Output, "color", "red")

You will notice that the output (red line) isn’t really applying any phase shift, it just has a different amplitude, which makes me believe that cuFFT stores the elements in another way, so the function operates on the wrong elements.
Any light on it?

  1. There’s really no difference between CUFFT and FFTW for this type of work. Therefore, if you can get it working purely in host code using FFTW, the implementation in CUFFT should be a mechanical translation. Stated another way, the problem you’re having is not a CUDA or CUFFT issue. For example, see here https://www.youtube.com/watch?v=YZ_NMdIbOYs

  2. If I were working on this, for understanding I absolutely would not start with R2C and C2R transforms. CUFFT hermitian symmetry (identical to FFTW symmetry: http://www.fftw.org/fftw3_doc/One_002dDimensional-DFTs-of-Real-Data.html) messes people up regularly. I would start with C2C transforms because that is how we all first learn about Fourier Transforms. It is much more intuitive, IMO, and algorithms like the writeup for how to do this on the matlab page:

https://www.mathworks.com/help/signal/ref/hilbert.html#d120e84897

presume that kind of data organization (this is evident from the indexing for the output element-wise windowing). Once you get it working with C2C transforms, you can then introduce R2C/C2R and symmetry handling, if you wish.

After a bit more thought, I think it should be evident that this calculation cannot be done in the hermitian-symmetry output space of a R2C transform. Hermitian symmetry has a specific meaning, and it does not mean that the output values that aren’t formally represented are zero. It means that there is a reflective type complex-conjugate symmetry in the output. If you then consider what the windowing function of the hilbert transform is doing to the (proper) output space of the forward transform, it is breaking that symmetry. Once you break that symmetry, you cannot use C2R to get a proper result, for the inverse transform.

Do this with C2C. Use the exact methodology, and exact indexing, and exact windowing calculation, given on the matlab page I linked above.

Later: even simpler. The hilbert transform defined by matlab can (does) produce a complex result:

https://www.mathworks.com/help/signal/ref/hilbert.html#d120e84565

clearly a C2R transform cannot do that.

If you want to use R2C I think it is still possible. Then in that case, following the matlab method I would:

  1. Use a R2C tranform on the input. This gives a result of length n/2+1
  2. Zero-extend (pad) the result out to length n.
  3. Leave elements of index 0 and n/2 alone.
  4. Double all other elements
  5. Perform inverse transform with C2C, of the full length n items.

Good evening, Robert, and thanks for your reply.
Your comment on the complex multiplication breaking the symmetry totally makes sense, and this is what I tried to understand when comparing Matlab/Octave’s method to cuFFT (and related). I have 4 threads opened showing this problem.

Switching to C2C, however, it already takes cufftComplex as arguments. In this case, should I copy my float input to the .x member of the cufftComplex and make .y all zeroes? Then I can refer to the interleaved copy demonstrated by Njuffa? The .y would be subject to a cudaMemset2D.

Yes, for a C2C forward transform of real data, set the imaginary component of each input element to zero

Here’s a worked example, using the input and output example values given on the matlab page:

https://www.mathworks.com/help/signal/ref/hilbert.html#d120e84565

(ie. [1,2,3,4] input)

$ cat t379.cu
#include <iostream>             // Standard I/O
#include <fstream>              // For file writing
#include <cufft.h>

#ifndef __CUDA_CALL
    #define __CUDA_CALL(call) do {\
        cudaError_t cuda_error = call;\
        if(cuda_error != cudaSuccess) {\
            std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << __FILE__ << ", line " << __LINE__ << std::endl;\
            return -1;}\
        } while(0)
#endif

using namespace std;

// Helper function to write input and output data
__host__ void Write_File(float *output, const int LENGTH, const char *file_name)
        {
        ofstream out_file(file_name);
        for(int i = 0; i < LENGTH; i++)
                out_file << output[i] << endl;

        out_file.close();
        }

__device__ const float PI = 3.141592654f;

// Fills the array in GPU
__global__ void Populate(float *input, const int LENGTH)
    {
    int tid     = blockDim.x * blockIdx.x + threadIdx.x,
    offset      = gridDim.x * blockDim.x;

    for(int global_idx = tid; global_idx < LENGTH; global_idx += offset)
        input[global_idx] = __sinf(global_idx * 2 * PI * 0.01);
    }

// Hilbert Transform programmed to resemble that of Matlab/Octave
__global__ void Hilbert(cufftComplex *dev_complex, const int LIMIT_1, const int LIMIT)
    {
    int tid     = blockDim.x * blockIdx.x + threadIdx.x+1,
        offset  = gridDim.x * blockDim.x;
    while(tid < LIMIT){
            if (tid < LIMIT_1-1)
            {
              dev_complex[tid].x *= 2;
              dev_complex[tid].y *= 2;
            }
            if (tid > LIMIT_1-1)
            {
              dev_complex[tid].x = 0;
              dev_complex[tid].y = 0;
            }
            tid += offset;}
    }

int main(void)
    {
    const int   LENGTH  = 4,
                B_SIZE  = 256,
                G_SIZE  = (LENGTH+B_SIZE-1)/B_SIZE,
                CPX_LEN = LENGTH / 2 + 1;       // Defines the complex array length according to cuFFT documentation

    cufftComplex   *output;
    cufftHandle    fwd_plan,
                   inv_plan;
#ifdef USE_R2C
    cufftReal    *input;
    cufftPlan1d(&fwd_plan, LENGTH, CUFFT_R2C, 1);               // Creates forward and inverse plans
#else
    cufftComplex    *input;
    cufftPlan1d(&fwd_plan, LENGTH, CUFFT_C2C, 1);               // Creates forward and inverse plans
#endif
    cufftPlan1d(&inv_plan, LENGTH, CUFFT_C2C, 1);

    // Allocates and initializes the managed arrays
    __CUDA_CALL(cudaMallocManaged(&input,  LENGTH * sizeof(input[0])));
    __CUDA_CALL(cudaMallocManaged(&output, LENGTH * sizeof(output[0])));
    memset(input,  0, LENGTH * sizeof(input[0]));
    memset(output, 0, LENGTH * sizeof(output[0]));
#ifdef USE_R2C
    input[0] = 1; input[1] = 2; input[2] = 3; input[3] = 4;
    cufftExecR2C(fwd_plan, input, output);      // Forward FFT
#else
    input[0].x = 1; input[1].x = 2; input[2].x = 3; input[3].x = 4;
    cufftExecC2C(fwd_plan, input, output, CUFFT_FORWARD);       // Forward FFT
#endif

    Hilbert <<< G_SIZE, B_SIZE >>> (output, CPX_LEN, LENGTH);
    __CUDA_CALL(cudaDeviceSynchronize());

    cufftExecC2C(inv_plan, output, output, CUFFT_INVERSE);      // Inverse FFT
    __CUDA_CALL(cudaDeviceSynchronize());

    for(int i = 0; i < LENGTH; i++){                    // Scales the iFFT
        output[i].x = output[i].x / LENGTH;
        output[i].y = output[i].y / LENGTH;}
    for (int i = 0; i < LENGTH; i++)
      std::cout << output[i].x << "," << output[i].y << std::endl;

    __CUDA_CALL(cudaFree(input));
    __CUDA_CALL(cudaFree(output));

    return 0;
    }
$ nvcc -o t379 t379.cu -lcufft -DUSE_R2C
$ ./t379
1,1
2,-1
3,-1
4,1
$ nvcc -o t379 t379.cu -lcufft
$ ./t379
1,1
2,-1
3,-1
4,1
$

Right now I’m studying the changes you made.
I see that you added +1 to line #42, and changed LIMIT accordingly.
Is it to skip the first element (DC component)?

Yes, according to the matlab method already references, the elements at matlab index 1 and n/2 +1 are not scaled. In C-style indexing this is 0 and n/2. To skip element 0, add one to the thread index for each thread. to skip element at n/2, when we pass the complex limit as n/2+1, we must only scale elements that are at indices less than complex limit -1. Likewise for threads where the index is greater than complex limit -1, we set these to zero, to account for the non USE_R2C case. In the USE_RTC case, we could simply stop the kernel at complex limit -1 for a small optimization.

Thanks for the clarification.
I am adjusting my program to use the C2C and the changes on the HT kernel, so I can test it on very large data. But with the 1-2-3-4 sample we can already see the rotation on the imag part.
As soon as it is finished, I report back.

I applied your changes to my test program: fixed the Hilbert kernel and replaced R2C/C2R with C2C.
The input array is filled in CPU now for code simplicity, and I am using the interleaved copy to fill the real part of the complex with my input data, initialize the imag with 0, and copy the imag part to host so I can apply an arbitrary phase shift with it using Euler’s formula.

However, copying the imag part to my host array seems to be wrong, as the Output.txt values show. To check that, I added a kernel just to print a few elements of the complex array in device memory right after the iFFT, and they look correct. When I copy this to host, they don’t match.

The complete working (???) code ready for copy/paste/compile is:

#include <iostream>	// Standard I/O
#include <fstream>	// For file writing
#include <cmath>        // For the sin() function
#include <cufft.h>

#ifndef __CUDA_CALL
    #define __CUDA_CALL(call) do {\
        cudaError_t cuda_error = call;\
        if(cuda_error != cudaSuccess) {\
            std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << __FILE__ << ", line " << __LINE__ << std::endl;\
        return -1;}\
        } while(0)
#endif

using namespace std;

// Helper function to write input and output data
__host__ void Write_File(float *output, const int LENGTH, const char *file_name)
    {
    ofstream out_file(file_name);
    for(int i = 0; i < LENGTH; i++)
        out_file << output[i] << endl;

    out_file.close();
    }

__global__ void Hilbert(cufftComplex *dev_complex, const int CPX_LEN, const int LENGTH)
    {
    int tid     = blockDim.x * blockIdx.x + threadIdx.x + 1,	// Skips the first element (DC component)
        offset  = gridDim.x * blockDim.x;

    while(tid < LENGTH)
        {
        if(tid < CPX_LEN - 1)
            {
            dev_complex[tid].x *= 2;
            dev_complex[tid].y *= 2;
            }
        if(tid > CPX_LEN - 1)
            {
            dev_complex[tid].x = 0.0f;
            dev_complex[tid].y = 0.0f;
            }
    	tid += offset;
        }
    }

__global__ void Print_Complex(cufftComplex *dev_complex, const int LENGTH)
    {
    for(int i = 0; i < 50; i++)
        printf("%.2f , %.0f\n", dev_complex[i].x / LENGTH, dev_complex[i].y / LENGTH);
    }

int main(void)
    {
    const int	LENGTH	= 1024,
    		CPX_LEN	= LENGTH / 2 + 1,
		B_SIZE	= 256,
                G_SIZE	= (LENGTH + B_SIZE - 1) / B_SIZE,
    const float PI	= 3.141592654f;

    float          input[LENGTH],
                   output[LENGTH];
    cufftComplex   *complex;
    cufftHandle    c2c_plan;

    cufftPlan1d(&c2c_plan, LENGTH, CUFFT_C2C, 1);    // Creates plan used for forward and inverse transforms

    // Allocates and initializes the complex GPU array
    __CUDA_CALL(cudaMalloc(&complex, LENGTH * sizeof(cufftComplex)));
    __CUDA_CALL(cudaMemset(complex, 0, LENGTH * sizeof(cufftComplex)));

    for(int i = 0; i < LENGTH; i++)		// Initializes the input array in CPU
        input[i] = sin(i * PI * 2 * 0.01);

    // Interleaved copy of the float input to the real part of the complex array, imaginary is already 0
    float *tmp_d = (float *) complex;
    __CUDA_CALL(cudaMemcpy2D(tmp_d, 2 * sizeof(float), input, 1 * sizeof(float), sizeof(float), LENGTH, cudaMemcpyHostToDevice));

    cufftExecC2C(c2c_plan, complex, complex, CUFFT_FORWARD);	// Forward FFT in-place
    __CUDA_CALL(cudaDeviceSynchronize());

    Hilbert <<< G_SIZE, B_SIZE >>> (complex, CPX_LEN, LENGTH);
    __CUDA_CALL(cudaDeviceSynchronize());

    cufftExecC2C(c2c_plan, complex, complex, CUFFT_INVERSE);	// Inverse FFT in-place
    __CUDA_CALL(cudaDeviceSynchronize());

    Print_Complex <<< 1, 1 >>> (complex, LENGTH);		// Prints a few real and imag elements
    __CUDA_CALL(cudaDeviceSynchronize());

    // Interleaved copy of the imaginary part of the complex back to the host
    __CUDA_CALL(cudaMemcpy2D(output, sizeof(float), tmp_d + 1, 2 * sizeof(float), sizeof(float), LENGTH, cudaMemcpyDeviceToHost));

    for(int i = 0; i < LENGTH; i++)			// Scales the iFFT when data is back in host
        output[i] = output[i] / LENGTH;

    Write_File(input,  LENGTH, "Input.txt");
    Write_File(output, LENGTH, "Output.txt");

    cufftDestroy(c2c_plan);
    __CUDA_CALL(cudaFree(complex));

    return 0;
    }

A few lines of the complex still in device are (right column look good):

0.00 , 1
0.06 , -0
0.13 , -0
0.19 , -0
0.25 , -0
0.31 , -0
0.37 , -0
0.43 , -0
0.48 , -0
0.54 , -0
0.59 , -0
0.64 , -0
0.68 , -0
0.73 , -0
0.77 , -0
0.81 , -0
0.84 , -0
0.88 , -0
0.90 , -0
0.93 , -0
0.95 , -0
0.97 , -0
0.98 , 0
0.99 , 0
1.00 , 0
1.00 , 0
1.00 , 0
0.99 , 0
0.98 , 0
0.97 , 0
0.95 , 0
0.93 , 1
0.90 , 1
0.88 , 1

After the copy in line #93, the imaginary part is (Output.txt):

0.568072
-0.101907
-0.0999114
-0.309304
-0.293624
-0.403848
-0.378174
-0.44176
-0.407678
-0.442566
-0.401106
-0.415802
-0.367827
-0.36725
-0.313591
-0.301022
-0.242525
-0.220411
-0.157959
-0.128285
-0.0628
-0.0272806
0.0402823
0.0801075
0.148767
0.191475
0.260237
0.304494
0.372365
0.416916
0.482918
0.52658
0.589763
0.63143

I double-checked the copies against https://stackoverflow.com/questions/20806490/how-to-get-the-real-and-imaginary-parts-of-a-complex-matrix-separately-in-cuda , so I really have no idea of where the error could be.

First of all, your posted code won’t compile. This can’t possibly be correct:

const int	LENGTH	= 1024,
    		CPX_LEN	= LENGTH / 2 + 1,
		B_SIZE	= 256,
                G_SIZE	= (LENGTH + B_SIZE - 1) / B_SIZE,
    const float PI	= 3.141592654f;

Fixing that, in your kernel, why are you printing out the complex value with zero decimal places:

printf("%.2f , %.0f\n", dev_complex[i].x / LENGTH, dev_complex[i].y / LENGTH);
               ^^^^

According to my testing that is the source of the discrepancy:

$ cat t381.cu
#include <iostream>     // Standard I/O
#include <fstream>      // For file writing
#include <cmath>        // For the sin() function
#include <cufft.h>

#ifndef __CUDA_CALL
    #define __CUDA_CALL(call) do {\
        cudaError_t cuda_error = call;\
        if(cuda_error != cudaSuccess) {\
            std::cerr << "CUDA Error: " << cudaGetErrorString(cuda_error) << ", " << __FILE__ << ", line " << __LINE__ << std::endl;\
        return -1;}\
        } while(0)
#endif

using namespace std;

// Helper function to write input and output data
__host__ void Write_File(float *output, const int LENGTH, const char *file_name)
    {
    ofstream out_file(file_name);
    for(int i = 0; i < LENGTH; i++)
        out_file << output[i] << endl;

    out_file.close();
    }

__global__ void Hilbert(cufftComplex *dev_complex, const int CPX_LEN, const int LENGTH)
    {
    int tid     = blockDim.x * blockIdx.x + threadIdx.x + 1,    // Skips the first element (DC component)
        offset  = gridDim.x * blockDim.x;

    while(tid < LENGTH)
        {
        if(tid < CPX_LEN - 1)
            {
            dev_complex[tid].x *= 2;
            dev_complex[tid].y *= 2;
            }
        if(tid > CPX_LEN - 1)
            {
            dev_complex[tid].x = 0.0f;
            dev_complex[tid].y = 0.0f;
            }
        tid += offset;
        }
    }

__global__ void Print_Complex(cufftComplex *dev_complex, const int LENGTH)
    {
    printf("on GPU: \n");
    for(int i = 0; i < 5; i++)
        printf("%f\n", dev_complex[i].y / LENGTH);
    }

int main(void)
    {
    const int   LENGTH  = 1024,
                CPX_LEN = LENGTH / 2 + 1,
                B_SIZE  = 256,
                G_SIZE  = (LENGTH + B_SIZE - 1) / B_SIZE;
    const float PI      = 3.141592654f;

    float          input[LENGTH],
                   output[LENGTH];
    cufftComplex   *complex;
    cufftHandle    c2c_plan;

    cufftPlan1d(&c2c_plan, LENGTH, CUFFT_C2C, 1);    // Creates plan used for forward and inverse transforms

    // Allocates and initializes the complex GPU array
    __CUDA_CALL(cudaMalloc(&complex, LENGTH * sizeof(cufftComplex)));
    __CUDA_CALL(cudaMemset(complex, 0, LENGTH * sizeof(cufftComplex)));

    for(int i = 0; i < LENGTH; i++)             // Initializes the input array in CPU
        input[i] = sin(i * PI * 2 * 0.01);

    // Interleaved copy of the float input to the real part of the complex array, imaginary is already 0
    float *tmp_d = (float *) complex;
    __CUDA_CALL(cudaMemcpy2D(tmp_d, 2 * sizeof(float), input, 1 * sizeof(float), sizeof(float), LENGTH, cudaMemcpyHostToDevice));

    cufftExecC2C(c2c_plan, complex, complex, CUFFT_FORWARD);    // Forward FFT in-place
    __CUDA_CALL(cudaDeviceSynchronize());

    Hilbert <<< G_SIZE, B_SIZE >>> (complex, CPX_LEN, LENGTH);
    __CUDA_CALL(cudaDeviceSynchronize());

    cufftExecC2C(c2c_plan, complex, complex, CUFFT_INVERSE);    // Inverse FFT in-place
    __CUDA_CALL(cudaDeviceSynchronize());

    Print_Complex <<< 1, 1 >>> (complex, LENGTH);               // Prints a few real and imag elements
    __CUDA_CALL(cudaDeviceSynchronize());

    // Interleaved copy of the imaginary part of the complex back to the host
    __CUDA_CALL(cudaMemcpy2D(output, sizeof(float), tmp_d + 1, 2 * sizeof(float), sizeof(float), LENGTH, cudaMemcpyDeviceToHost));

    for(int i = 0; i < LENGTH; i++)                     // Scales the iFFT when data is back in host
        output[i] = output[i] / LENGTH;
    std::cout << "On CPU: " << std::endl;
    for(int i = 0; i < 5; i++)
        std::cout << output[i] << std::endl;

    cufftDestroy(c2c_plan);
    __CUDA_CALL(cudaFree(complex));

    return 0;
    }
$ nvcc -o t381 t381.cu -lcufft
$ ./t381
on GPU:
0.568072
-0.101907
-0.099911
-0.309304
-0.293624
On CPU:
0.568072
-0.101907
-0.0999114
-0.309304
-0.293624
$

Apologies for the idiocy. Missing semicolon and this truncation.
It definitely works now!
Thanks for your patience and guidance.

Could I just ask why you determine the grid size as:

G_SIZE = (LENGTH + B_SIZE - 1) / B_SIZE

?
I normally use 256 threads per block and find the number of blocks according to nvvp. When it says that I am close to 100% occupancy for the largest data size I expect my program to run on, and that more blocks don’t change or show diminishing returns, that’s what I use to launch that kernel (hard-coded).

It’s CUDA 101:
https://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf
(slide 44)
There’s no particular reason. You can size grids in a variety of ways, especially with grid stride loops, which is what you have. It wasn’t intended to be the focus of my posting nor did I perceive it to be the crux of, or even that relevant to, your concern. I gave it very little thought, therefore. I’m not creating a masterpiece. Do what you wish.

Thank you, sir!