CUDA FFT different from Matlab FFT

Manugal · September 16, 2010, 4:52pm

Hi!

I’m porting a Matlab application to CUDA. I need to calculate FFT by cuFFT library, but results between Matlab fft() and CUDA fft are different. I’ve seen around this forum ( http://forums.nvidia.com/index.php?showtop…mp;#entry589016 ) and others, that the problems resides in a different representation of matrix (row-major orderin CUDA vs column-major order in Matlab) and in a different representation of complex values (in CUDA we have a Array of Struct, while in Matlab we have a Struct of Array). The Matlab instruction that I’m trying to convert:

% A is a 1013*256-matrix

% B will be a 1024*256-matrix

B = fft(A, 1024, 1);

So I’ve made these trials:

n-point cudaFFT on A in 1-dimension

cufftHandle plan;

cudaPlan1d(&plan,1024*256,CUFFT_C2C,1);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);

cudaFFT on A in 2-dimension

cufftHandle plan;

cudaPlan2d(&plan,1024,256,CUFFT_C2C);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);

n-point cudaFFT on A^t (the transpose of A) in 1-dimension

cufftHandle plan;

cudaPlan1d(&plan,1024*256,CUFFT_C2C,1);

cufftExecC2C(plan,At,B,CUFFT_FORWARD);

cudaFFT on A in 2-dimension with args N and M swapped

cufftHandle plan;

cudaPlan2d(&plan,256,1024,CUFFT_C2C);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);

None of these gave me results more or less equals to Matlab fft(). What can I do?

vivekv80 · September 16, 2010, 5:03pm

if you divide the GPU FFT results by the size of the FFT, the results should match between matlab & CUDA.

vivekv80 · September 16, 2010, 5:03pm

if you divide the GPU FFT results by the size of the FFT, the results should match between matlab & CUDA.

Manugal · September 16, 2010, 5:05pm

Ok, I will try. Thanks. ;)

Manugal · September 16, 2010, 5:05pm

Ok, I will try. Thanks. ;)

Manugal · September 17, 2010, 8:56am

I’ve tried but without success. But which of the above transforms do I apply? And do I need to pass the transpose of A? FFT size is 1024*256, right?

Manugal · September 17, 2010, 8:56am

I’ve tried but without success. But which of the above transforms do I apply? And do I need to pass the transpose of A? FFT size is 1024*256, right?

mfatica · September 17, 2010, 1:42pm

Look at the code in [url=“http://developer.download.nvidia.com/compute/cuda/1_1/Matlab_Cuda_1.1.tgz”]http://developer.download.nvidia.com/compu...ab_Cuda_1.1.tgz[/url].

You need to use the 2D transform, by the way.

mfatica · September 17, 2010, 1:42pm

Look at the code in [url=“http://developer.download.nvidia.com/compute/cuda/1_1/Matlab_Cuda_1.1.tgz”]http://developer.download.nvidia.com/compu...ab_Cuda_1.1.tgz[/url].

You need to use the 2D transform, by the way.

Manugal · September 18, 2010, 11:40am

I’ve already seen that, but I don’t know how adjust it for my program. Recapitulating, the situation is the following:

I’ve defined some variables: Nblock = 1014, Ncell = 256
Initially, I launch a kernel that outputs an (Nblock-1)-by-Ncell matrix (truly, it’s a flattened cufftComplex array of (Nblock-1)*Ncell elements). This array, named snap_shot, has the same values as the one produced by Matlab (in Matlab it’s just an (Nblock-1)-by-Ncell matrix, not an array). So there’s no need to “pack” complex values as in Matlab_Cuda_1.1 plugin, right?
After kernel, I need to launch an FFT on this array. On Matlab the following instruction is used (figuring out a 1024-by-256 matrix):

temp_fft = fft(snap_shot, 2^nextpow2(Nblock), 1);   % 2^nextpow2(Nblock)==1024

Instead, on CUDA I’ve launched one of the following cuFFT:

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, Ncell*nxtPow2Nblock*sizeof(cufftComplex));

...

cufftHandle plan;

cufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

or

cufftHandle plan;

cufftPlan1d(&plan,Ncell*nxtPow2Nblock,CUFFT_C2C,1);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

All of these gives me different results compared with Matlab ones. As suggested here, I’ve also tried to divide FFT results by the size of FFT (which is nxtPow2Nblock*Ncell, right?); however, I always have different results from Matlab. I don’t know what to do.

Manugal · September 18, 2010, 11:40am

I’ve already seen that, but I don’t know how adjust it for my program. Recapitulating, the situation is the following:

I’ve defined some variables: Nblock = 1014, Ncell = 256
Initially, I launch a kernel that outputs an (Nblock-1)-by-Ncell matrix (truly, it’s a flattened cufftComplex array of (Nblock-1)*Ncell elements). This array, named snap_shot, has the same values as the one produced by Matlab (in Matlab it’s just an (Nblock-1)-by-Ncell matrix, not an array). So there’s no need to “pack” complex values as in Matlab_Cuda_1.1 plugin, right?
After kernel, I need to launch an FFT on this array. On Matlab the following instruction is used (figuring out a 1024-by-256 matrix):

temp_fft = fft(snap_shot, 2^nextpow2(Nblock), 1);   % 2^nextpow2(Nblock)==1024

Instead, on CUDA I’ve launched one of the following cuFFT:

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, Ncell*nxtPow2Nblock*sizeof(cufftComplex));

...

cufftHandle plan;

cufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

or

cufftHandle plan;

cufftPlan1d(&plan,Ncell*nxtPow2Nblock,CUFFT_C2C,1);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

All of these gives me different results compared with Matlab ones. As suggested here, I’ve also tried to divide FFT results by the size of FFT (which is nxtPow2Nblock*Ncell, right?); however, I always have different results from Matlab. I don’t know what to do.

vivekv80 · September 18, 2010, 9:46pm

You definitely have to do a 2D FFT. When you say you have different results with Matlab what do you see?

for example: f2 = fftn(ref20); in Matlab,
CUDA equivalent:
cufftHandle plan_forward2;
CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward2, pix1, pix2, CUFFT_C2C)); //pix1 and pix2 are the size of the FFT in x and y dimension
CUFFT_SAFE_CALL(cufftExecC2C(plan_forward2, in2_d, f2_d, CUFFT_FORWARD));
CUFFT_SAFE_CALL(cufftDestroy(plan_forward2));

vivekv80 · September 18, 2010, 9:46pm

You definitely have to do a 2D FFT. When you say you have different results with Matlab what do you see?

for example: f2 = fftn(ref20); in Matlab,
CUDA equivalent:
cufftHandle plan_forward2;
CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward2, pix1, pix2, CUFFT_C2C)); //pix1 and pix2 are the size of the FFT in x and y dimension
CUFFT_SAFE_CALL(cufftExecC2C(plan_forward2, in2_d, f2_d, CUFFT_FORWARD));
CUFFT_SAFE_CALL(cufftDestroy(plan_forward2));

Manugal · September 19, 2010, 9:52am

I’ve already made that, but without success. What do you mean for “When you say you have different results with Matlab what do you see”? Simply, I see that the results between CUDA and Matlab are not equals (or at least they are not similar).

Manugal · September 19, 2010, 9:52am

I’ve already made that, but without success. What do you mean for “When you say you have different results with Matlab what do you see”? Simply, I see that the results between CUDA and Matlab are not equals (or at least they are not similar).

E.D_Riedijk · September 19, 2010, 6:32pm

But are the results equal considering that matlab is double precision and cuda single precision?

E.D_Riedijk · September 19, 2010, 6:32pm

But are the results equal considering that matlab is double precision and cuda single precision?

Manugal · September 19, 2010, 6:52pm

I’ve also tried with cufftDoubleComplex type and with cufftExecZ2Z; it doesn’t seem a precision problem.

This is a snippet of my main():

int main(int argc, char *argv[]) {

...

float2 *dAmb;

cudaMalloc((void **) &dAmb, nxtPow2Nblock*Ncell*sizeof(float2));

cufftComplex *dsnap_shot;

cudaMalloc((void **) &dsnap_shot, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftHandle plan;

icufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

float2 *hAmb;

cudaMallocHost((void **) &hAmb, nxtPow2Nblock*Ncell*sizeof(float2));

...

ambfunc_kern<<<numBlocks,numThreads>>>(dhNc,dhNblock,Nblock,block_size,dsnap_shot);  // After this dsnap_shot will have correct values

cudaThreadSynchronize();

cufftExecC2C(plan,dsnap_shot,temp_fft,CUFFT_FORWARD);  // After this temp_fft has wrong values

cudaThreadSynchronize();

fftshift_kern<<<numBlocks,numThreads>>>(temp_fft,nxtPow2Nblock*Ncell,nxtPow2Nblock,block_size,dA

mb);	

cudaMemcpy(hAmb,dAmb,nxtPow2Nblock*Ncell*sizeof(float2),cudaMemcpyDeviceToHost);

...

}

Manugal · September 19, 2010, 6:52pm

I’ve also tried with cufftDoubleComplex type and with cufftExecZ2Z; it doesn’t seem a precision problem.

This is a snippet of my main():

int main(int argc, char *argv[]) {

...

float2 *dAmb;

cudaMalloc((void **) &dAmb, nxtPow2Nblock*Ncell*sizeof(float2));

cufftComplex *dsnap_shot;

cudaMalloc((void **) &dsnap_shot, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftHandle plan;

icufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

float2 *hAmb;

cudaMallocHost((void **) &hAmb, nxtPow2Nblock*Ncell*sizeof(float2));

...

ambfunc_kern<<<numBlocks,numThreads>>>(dhNc,dhNblock,Nblock,block_size,dsnap_shot);  // After this dsnap_shot will have correct values

cudaThreadSynchronize();

cufftExecC2C(plan,dsnap_shot,temp_fft,CUFFT_FORWARD);  // After this temp_fft has wrong values

cudaThreadSynchronize();

fftshift_kern<<<numBlocks,numThreads>>>(temp_fft,nxtPow2Nblock*Ncell,nxtPow2Nblock,block_size,dA

mb);	

cudaMemcpy(hAmb,dAmb,nxtPow2Nblock*Ncell*sizeof(float2),cudaMemcpyDeviceToHost);

...

}

Manugal · September 21, 2010, 9:04am

Any other suggestions?

Topic		Replies	Views
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1664	October 12, 2021
[SOLVED] cuFFT data storage, maybe I'm operating on the wrong elements CUDA Programming and Performance	16	2131	June 11, 2022
cuFFT and fftw CUDA Programming and Performance	10	4172	August 25, 2010
Getting Segmentation fault for higher sized input arrays for 2d CuFFT application. CUDA Programming and Performance	15	2893	July 10, 2014
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing Technical Blog	18	1212	August 15, 2023
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241057	November 16, 2011
Image filtering in frequency domain GPU-Accelerated Libraries	6	3616	October 3, 2014
cufft question CUDA Programming and Performance	6	8653	March 9, 2009
Matlab 2007b and CUDA Can't compile CUDA example with Matlab CUDA Programming and Performance	19	20103	June 25, 2009
cufftExecR2C only gives half the answer..?! CUDA Programming and Performance	2	4216	July 24, 2009

CUDA FFT different from Matlab FFT

Related topics