Hi everyone!

I’m trying to develop a parallel version of Toeplitz Hashing using FFT on GPU, in CUFFT/CUDA.

And when I try to create a CUFFT 1D Plan, I get an error, which is not much explicit (CUFFT_INTERNAL_ERROR)…

The way I create the cufftPlan1d is the following:

```
cufftResult cufft_result;
cufftHandle plan;
cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, 1 );
if( cufft_result != CUFFT_SUCCESS ) {
printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );
exit(-1);
}
```

The `data_block_length`

(an unsigned long int) is equal to 50397139 and I’m using a BATCH of 1 in CUFFT. I verified that for some greater sizes that I tried, I do not get any errors. So, I’m almost sure that is not any memory problem. In particular, for an array of complex numbers with double precision with this size, I should need 16 (2 x 8) x 50397139 = 806354224 bytes = 0.75 gigabytes (approximately) and I have a GPU with 6 gigabytes. I also made all the steps required a priori: cudaMalloc, cudaMemcpy, etc.

However, I read some forums and documentation, I have some doubts regarding some aspects:

- It is very recommended to use Batches greater than 1 in CUFFT. In that case I need to use a transform size equal to
`(data_block_length / BATCH)`

when I call the cufftPlan1D. It is always worthy to use batches for the CUFFT plans? - In some documentation and information I read, I noticed a lot of people mention the use of power of 2 sizes. Is this a requirement? I can never use a CUFFT plans with sizes different than a power of 2?

I also tried these two options by using zero padding in the remaining positions of the array to get have a size which is a power of 2. And I also tried to use batches with the original size and with power of 2 sizes with zero-padding.

In these trials, I do not get any error, but I get the wrong results. I also developed other versions of my program (in Python using SciPy functions to compute the Toeplitz matrices and fft, serial in C++ using FFTW, and in parallel C++ using FFTW with OpenMP), and I got different results from those versions (with an error around 50%, which suggests that the computation and results are maybe random)…

Can someone help me? I’m feeling a little lost, and the documentation does not help too much sometimes… :(

Thank you in advance!