Hi,

I’m using cufftDx in order to perform convolution.

I got some non-reasonable results so I tried to figure out where does the problem come from.

I commented out part of the code, simplify the process and found the problem:

I have a data vector of 1024 complex floating point elements.

I filled the vector with the same number -40 + 0j so I have 1024 elements of the same complex number.

After executing FFT with cufftDx, I printed the data and found it’s all zeros except the first element: -40960 + 0j, as expected.

But, after executing IFFT, I printed the data and found it’s all the same number - -40960 + 0j instead of being -40 + 0j as expected.

This is the result of double FFT instead of FFT and IFFT.

What can I do? am I missing something?

Thank you in advance,

Ori

My code look like:

```
// Host Code:
static constexpr unsigned int fft_size1 = 1024;
using FFT_base = decltype(Block() + Size<fft_size1>() + Type<fft_type::c2c>() + Precision<float>() +
ElementsPerThread<2>() + FFTsPerBlock<1>() + SM<750>());
using FFT = decltype(FFT_base() + Direction<fft_direction::forward>());
using IFFT = decltype(FFT_base() + Direction<fft_direction::inverse>());
cudaFuncSetAttribute(
my_kernel<FFT, IFFT>,
cudaFuncAttributeMaxDynamicSharedMemorySize,
FFT::shared_memory_size );
my_kernel<FFT, IFFT><<<GridSizeKernel, FFT::block_dim, FFT::shared_memory_size >>>(data);
// Device Code:
template<class FFT, class IFFT>
__launch_bounds__(FFT::max_threads_per_block)
__global__ my_kernel(complex *data){
// load data to shared memory
FFT().execute(shared_mem);
__syncthreads();
// printing data
IFFT().execute(shared_mem);
__syncthreads();
// printing data
}
```