Hi everyone!
I’m trying to develop a parallel version of Toeplitz Hashing using FFT on GPU, in CUFFT/CUDA.
And when I try to create a CUFFT 1D Plan, I get an error, which is not much explicit (CUFFT_INTERNAL_ERROR)…
The way I create the cufftPlan1d is the following:
cufftResult cufft_result;
cufftHandle plan;
cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, 1 );
if( cufft_result != CUFFT_SUCCESS ) {
printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );
exit(-1);
}
The data_block_length
(an unsigned long int) is equal to 50397139 and I’m using a BATCH of 1 in CUFFT. I verified that for some greater sizes that I tried, I do not get any errors. So, I’m almost sure that is not any memory problem. In particular, for an array of complex numbers with double precision with this size, I should need 16 (2 x 8) x 50397139 = 806354224 bytes = 0.75 gigabytes (approximately) and I have a GPU with 6 gigabytes. I also made all the steps required a priori: cudaMalloc, cudaMemcpy, etc.
However, I read some forums and documentation, I have some doubts regarding some aspects:
- It is very recommended to use Batches greater than 1 in CUFFT. In that case I need to use a transform size equal to
(data_block_length / BATCH)
when I call the cufftPlan1D. It is always worthy to use batches for the CUFFT plans? - In some documentation and information I read, I noticed a lot of people mention the use of power of 2 sizes. Is this a requirement? I can never use a CUFFT plans with sizes different than a power of 2?
I also tried these two options by using zero padding in the remaining positions of the array to get have a size which is a power of 2. And I also tried to use batches with the original size and with power of 2 sizes with zero-padding.
In these trials, I do not get any error, but I get the wrong results. I also developed other versions of my program (in Python using SciPy functions to compute the Toeplitz matrices and fft, serial in C++ using FFTW, and in parallel C++ using FFTW with OpenMP), and I got different results from those versions (with an error around 50%, which suggests that the computation and results are maybe random)…
Can someone help me? I’m feeling a little lost, and the documentation does not help too much sometimes… :(
Thank you in advance!