cufftExecR2C only gives half the answer..?!

hello every one,

Ive been messing with the cufft

I have an array of 256 numbers, which I load onto the device in an array of cufftReal d_rF.

then I also have an array of the same length but cufftComplex d_F.

[codebox]

.

.

.

float* h_Rdata = (float*)malloc(sizeof(float) * NX);

Complex* h_data = (Complex*)malloc(sizeof(Complex) * NX);

.

.

.

cutilSafeCall(cudaMemcpy(h_Rdata, d_rF, sizeof(float) * NX, cudaMemcpyDeviceToHost));

for(int x=0;x<256;x++)

{

cout<<x<<" "<<h_Rdata[x]<<endl;

}

// Create a 1D FFT plan.

cufftPlan1d(&plan, NX, CUFFT_R2C, 1);

// Use the CUFFT plan to transform the signal in place.

cufftExecR2C(plan, d_rF, d_F);

cutilSafeCall(cudaMemcpy(h_data, d_F, sizeof(Complex) * NX, cudaMemcpyDeviceToHost));

for(int x=0;x<256;x++)

{

cout<<x<<" "<<h_data[x].x<<" "<<h_data[x].y<<endl;

}

[/codebox]

This is the wierd output:

0 1.62309

1 1.59654

2 1.78235

3 1.52576

4 1.5877

5 1.49921

6 1.46382

7 1.36648

8 1.5346

9 1.50806

10 1.50806

11 1.2957

12 1.45497

13 1.57884

14 1.47266

15 1.44612

16 1.47266

17 1.37534

18 1.25146

19 1.43727

20 1.04795

21 1.22491

22 1.26916

23 1.18952

24 1.32225

25 1.28685

26 0.906373

27 1.23376

28 1.23376

29 1.18952

30 1.10104

31 1.0391

32 1.04795

33 0.941772

34 1.0214

35 0.844436

36 1.00371

37 0.82674

38 0.72056

39 0.809045

40 0.79135

41 0.702865

42 0.525904

43 0.72056

44 0.817897

45 0.649779

46 0.596685

47 0.525904

48 0.658623

49 0.649779

50 0.481662

51 0.676318

52 0.543599

53 0.47281

54 0.552443

55 0.410872

56 0.463967

57 0.463967

58 0.348935

59 0.348935

60 0.419724

61 0.145427

62 0.233911

63 0.446263

64 0.402029

65 0.384325

66 0.216216

67 0.534748

68 0.384325

69 0.171974

70 0.180817

71 0.348935

72 0.189669

73 0.295849

74 0.233911

75 0.145427

76 0.0657939

77 0.145427

78 -0.0315426

79 0.198513

80 0.163122

81 0.145427

82 0.030395

83 0.030395

84 -0.252746

85 -0.0757849

86 0.0834892

87 0.0215516

88 0.251607

89 -0.040386

90 0.15427

91 -0.146566

92 -0.128871

93 -0.0492379

94 0.00385633

95 -0.120019

96 -0.288136

97 -0.128871

98 -0.120019

99 -0.279293

100 -0.146566

101 -0.146566

102 -0.146566

103 -0.0138389

104 -0.429715

105 -0.296988

106 -0.491653

107 -0.137723

108 -0.44741

109 -0.341231

110 -0.394316

111 -0.376621

112 -0.164261

113 -0.341231

114 -0.226199

115 -0.217355

116 -0.181956

117 -0.270441

118 -0.341231

119 -0.544739

120 -0.332379

121 -0.367778

122 -0.367778

123 -0.606676

124 -0.394316

125 -0.527043

126 -0.473949

127 -0.438559

128 -0.535887

129 -0.429715

130 -0.571286

131 -0.668614

132 -0.55359

133 -0.704004

134 -0.668614

135 -0.704004

136 -0.757098

137 -0.907521

138 -0.677466

139 -0.942911

140 -0.951763

141 -1.06679

142 -0.730551

143 -0.898669

144 -0.801341

145 -0.907521

146 -0.704004

147 -0.951763

148 -1.18182

149 -0.951763

150 -0.925216

151 -1.06679

152 -0.695161

153 -0.942911

154 -1.00485

155 -0.898669

156 -1.12872

157 -0.934059

158 -1.04909

159 -0.872122

160 -1.11987

161 -1.11987

162 -0.898669

163 -1.04909

164 -1.37647

165 -1.24376

166 -1.10218

167 -1.2526

168 -0.995997

169 -1.11987

170 -1.11987

171 -1.19951

172 -1.31454

173 -1.18182

174 -1.34993

175 -1.41186

176 -1.27915

177 -1.31454

178 -1.36763

179 -1.38533

180 -1.14642

181 -1.41186

182 -1.48265

183 -1.2349

184 -1.29684

185 -1.41186

186 -1.47381

187 -1.37647

188 -1.49151

189 -1.33223

190 -1.46496

191 -1.34108

192 -1.45611

193 -1.39417

194 -1.55344

195 -1.15527

196 -1.24376

197 -1.26145

198 -1.32339

199 -1.35878

200 -1.27915

201 -1.19951

202 -1.09333

203 -1.12872

204 -1.12872

205 -0.783646

206 -1.06679

207 -1.15527

208 -0.880974

209 -0.995997

210 -0.889817

211 -0.951763

212 -0.942911

213 -0.748247

214 -0.597824

215 -0.748247

216 -0.491653

217 -0.438559

218 -0.491653

219 -0.473949

220 -0.385473

221 -0.0226908

222 -0.0669331

223 -0.111175

224 -0.0669331

225 0.216216

226 -0.208504

227 0.180817

228 0.242755

229 0.180817

230 0.136575

231 0.22506

232 0.552443

233 0.570147

234 0.623232

235 0.658623

236 0.658623

237 0.835592

238 1.39303

239 1.37534

240 1.41073

241 1.38419

242 1.66733

243 1.63194

244 2.12744

245 1.62309

246 1.94163

247 2.04781

248 2.27786

249 2.09205

250 2.10974

251 1.96817

252 2.22477

253 2.12744

254 2.12744

255 2.3044

0 0.013223 0

1 125.589 -97.4914

2 73.5993 2.05362

3 27.3758 14.4973

4 9.83228 12.6541

5 4.14769 6.37364

6 4.24651 11.6731

7 0.0536612 7.13867

8 1.26574 8.52379

9 -1.65793 4.35206

10 -1.63675 5.52411

11 -2.1299 2.90129

12 -3.42531 -0.483899

13 -2.56707 1.19902

14 0.613707 0.74622

15 2.0512 2.65198

16 1.28713 0.761634

17 -0.0389231 2.32018

18 1.90128 0.542881

19 -1.1291 2.20142

20 2.30096 -0.85234

21 -0.353202 0.25511

22 -1.20106 2.26068

23 0.7145 -1.16371

24 -0.593511 -0.199678

25 0.0715992 0.746205

26 -0.758886 0.812164

27 -0.30864 0.616363

28 -1.16105 1.16024

29 0.545582 -2.17984

30 1.54297 1.04905

31 -0.525643 1.7628

32 2.63921 0.322476

33 -1.01197 1.25928

34 -0.148713 1.0354

35 2.57134 2.2437

36 -0.879272 -1.65674

37 -1.90698 0.769346

38 0.772566 -0.0143051

39 -1.141 0.306798

40 -1.5687 2.21978

41 -0.964771 -0.656651

42 -0.391873 -0.401092

43 0.108266 -0.33612

44 1.589 1.83535

45 -0.066745 0.76091

46 0.622943 0.612212

47 -0.448492 2.13539

48 -2.92175 1.59187

49 -2.8068 0.466165

50 -1.96576 1.58289

51 -2.56866 0.416314

52 -0.403031 1.77096

53 -0.270214 -1.23578

54 -0.73457 -1.5027

55 1.00639 -0.320103

56 -0.275896 -0.121656

57 1.98626 0.74013

58 0.150748 -2.23046

59 -0.242996 0.440404

60 2.16056 -0.806757

61 -0.221635 2.14119

62 -1.95433 0.571309

63 0.486124 0.365236

64 1.0441 -0.106205

65 2.76061 -1.07172

66 -0.706307 -0.459384

67 -2.15571 0.93468

68 -0.0629404 3.06563

69 -2.59316 0.233405

70 -1.35508 -1.16166

71 0.281918 0.100832

72 0.387475 1.14622

73 -1.52258 -0.399315

74 1.72536 0.418018

75 -2.12741 3.01702

76 -0.199979 0.758471

77 -0.348014 1.11992

78 -3.70223 -0.103214

79 -0.85485 0.408197

80 0.00513601 0.327361

81 1.19659 -0.00268471

82 -0.00901607 1.52935

83 -0.709262 -0.264943

84 2.46532 -0.917811

85 0.960356 0.0340135

86 -1.48646 0.522672

87 0.0822357 -0.47101

88 -2.91026 -0.279893

89 -0.546254 2.30899

90 -0.0820942 1.12444

91 1.09606 1.9527

92 -2.9062 2.74379

93 -0.792652 2.02758

94 -1.00118 1.45457

95 -0.422203 -2.01798

96 1.11249 -0.597761

97 -1.3728 0.6111

98 -0.90328 -0.11501

99 0.146492 1.10298

100 -1.46592 -1.03234

101 -0.174138 -0.300478

102 -1.25809 2.31031

103 -0.89533 -0.576144

104 -1.87388 1.17083

105 2.50249 -0.581984

106 0.221582 -0.280879

107 0.273538 0.512227

108 0.466491 -1.73086

109 -1.75707 1.89075

110 0.0494137 0.620006

111 0.575393 0.990245

112 -2.05137 1.12522

113 -1.21459 -0.37617

114 0.0487164 -0.142939

115 -0.353316 -0.392413

116 -1.91559 0.352213

117 0.294271 1.14546

118 -3.17852 0.559147

119 0.469072 3.18798

120 -2.21722 -1.4865

121 -1.23593 -1.70163

122 2.38779 -0.255282

123 0.384482 -2.04733

124 0.966435 -0.85213

125 -0.0983543 -1.85198

126 3.11217 0.201201

127 -0.0226593 -0.469227

128 0.725556 0

129 0 0

130 0 0

131 0 0

132 0 0

133 0 0

134 0 0

135 0 0

136 0 0

137 0 0

138 0 0

139 0 0

140 0 0

141 0 0

142 0 0

143 0 0

144 0 0

145 0 0

146 0 0

147 0 0

148 0 0

149 0 0

150 0 0

151 0 0

152 0 0

153 0 0

154 0 0

155 0 0

156 0 0

157 0 0

158 0 0

159 0 0

160 0 0

161 0 0

162 0 0

163 0 0

164 0 0

165 0 0

166 0 0

167 0 0

168 0 0

169 0 0

170 0 0

171 0 0

172 0 0

173 0 0

174 0 0

175 0 0

176 0 0

177 0 0

178 0 0

179 0 0

180 0 0

181 0 0

182 0 0

183 0 0

184 0 0

185 0 0

186 0 0

187 0 0

188 0 0

189 0 0

190 0 0

191 0 0

192 0 0

193 0 0

194 0 0

195 0 0

196 0 0

197 0 0

198 0 0

199 0 0

200 0 0

201 0 0

202 0 0

203 0 0

204 0 0

205 0 0

206 0 0

207 0 0

208 0 0

209 0 0

210 0 0

211 0 0

212 0 0

213 0 0

214 0 0

215 0 0

216 0 0

217 0 0

218 0 0

219 0 0

220 0 0

221 0 0

222 0 0

223 0 0

224 0 0

225 0 0

226 0 0

227 0 0

228 0 0

229 0 0

230 0 0

231 0 0

232 0 0

233 0 0

234 0 0

235 0 0

236 0 0

237 0 0

238 0 0

239 0 0

240 0 0

241 0 0

242 0 0

243 0 0

244 0 0

245 0 0

246 0 0

247 0 0

248 0 0

249 0 0

250 0 0

251 0 0

252 0 0

253 0 0

254 0 0

255 0 0

So can anyone explain why i only get the first half of the answers?

It is explained in the manual, page 3

Tags Keywords: CUDA FFT cufft cufftExecR2C cufftExecC2R cufftHandle cufftPlan2d cufftComplex fft2 ifft2 ifft inverse

================

I’m posting this hoping it will save some other people time – I am a programmer who needed to use FFTs in CUDA, and figured a lot of things out along the way.

First, some sample code, then an explanation. This is not the prettiest thing, but it should get the point across and be something to quickly get you started.

#define IMAGE_DIM 256

grid = dim3(IMAGE_DIM / 16, IMAGE_DIM / 16, 1);

threads = dim3(16, 16, 1);	

// Declare handles to the FFT plans

cufftHandle forwardFFTPlan;

cufftHandle inverseFFTPlan;

// Create the plans -- forward and reverse (Real2Complex, Complex2Real)

CUFFT_SAFE_CALL( cufftPlan2d(&forwardFFTPlan, IMAGE_DIM, IMAGE_DIM, CUFFT_R2C) );

CUFFT_SAFE_CALL( cufftPlan2d(&inverseFFTPlan, IMAGE_DIM, IMAGE_DIM, CUFFT_C2R) );

int num_real_elements = IMAGE_DIM * IMAGE_DIM;

int num_complex_elements = IMAGE_DIM * (IMAGE_DIM / 2 + 1);

// HOST MEMORY

float *h_img;

float *h_imgF;

// ALLOCATE HOST MEMORY

h_img = (float*)malloc(m_num_real_elements * sizeof(float));

h_complex_imgSpec = (cufftComplex*)malloc(m_num_complex_elements * sizeof(cufftComplex));

h_imgF = (float*)malloc(m_num_real_elements * sizeof(float));

for (int x=0; x < IMAGE_DIM; x++)

{

   for (int y=0; y < IMAGE_DIM; y++)

   {

	  // initialize the input image memory somehow

	  // (this probably comes from a file or image buffer or something)

	  h_img[y*IMAGE_DIM+x] = 0.0f;

   }

}

// DEVICE MEMORY

float *d_img;

cufftComplex *d_complex_imgSpec;

float *d_imgF;

// ALLOCATE DEVICE MEMORY

CUDA_SAFE_CALL( cudaMalloc( (void**) &img, m_num_real_elements * sizeof(float)));

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_complex_imgSpec, m_num_complex_elements * sizeof(cufftComplex)) );	

CUDA_SAFE_CALL( cudaMalloc( (void**) &img, m_num_real_elements * sizeof(float)));

// copy host memory to device (input image)

CUDA_SAFE_CALL( cudaMemcpy( d_img, h_img, m_num_real_elements * sizeof(float), cudaMemcpyHostToDevice) );

	

// now run the forward FFT on the device (real to complex)

CUFFT_SAFE_CALL( cufftExecR2C(forwardFFTPlan, d_img, d_complex_imgSpec) );

// copy the DEVICE complex data to the HOST

// NOTE: we are only doing this so that you can see the data -- in general you want

// to do your computation on the GPU without wasting the time of copying it back to the host

CUDA_SAFE_CALL( cudaMemcpy( h_complex_imgSpec, d_complex_imgSpec, m_num_complex_elements * sizeof(cufftComplex), cudaMemcpyDeviceToHost) );

// print the complex data so you can see what it looks like

for (int x=0; x < (IMAGE_DIM/2+1); x++)

{

   for (int y=0; y < IMAGE_DIM; y++)

   {

	  // initialize the input image memory somehow

	  // (this probably comes from a file or image buffer or something)

	  printf("h_complex_imgSpec[%d,%d] = %f + %fi\n", x, y, h_complex_imgSpec[y*(IMAGE_DIM/2+1)+x].x, h_complex_imgSpec[y*(IMAGE_DIM/2+1)+x].y);

   }

}

// here you can modify or filter the data in the frequency domain

// TODO: insert your filter code here, or whatever

// NOTE: you can/should modify it on the GPU/DEVICE, not the HOST

// IF you modify it on the HOST you will need to cudaMemcpy it back to the DEVICE

// now run the inverse FFT on the device (complex to real)

cufftExecC2R(inverseFFTPlan, d_complex_imgSpec, d_imgF);

// NOTE: the data in d_imgF is not normalized at this point

// Normalize the data in place - IFFT Normalization is 

// dividing all elements by the total numbers of elements in the matrix/image/array (ie, number of pixels)

NormalizeIFFT<<< grid, threads >>> (d_imgF, IMAGE_DIM, IMAGE_DIM, 256.0f * 256.0f);

// Copy the DEVICE memory to the HOST memory

CUDA_SAFE_CALL( cudaMemcpy( h_imgF, d_imgF, m_num_real_elements * sizeof(float), cudaMemcpyDeviceToHost) );

// print the elements of the resulting data

for (int i=0; i < m_num_real_elements; i++) 

{

   printf("h_imgF[%d] = %f\n", i, h_imgF[i]);

}

// CLEANUP HOST MEMORY

free(h_img);

free(h_imgF);

// CLEANUP DEVICE MEMORY

CUDA_SAFE_CALL(cudaFree(d_img));

CUDA_SAFE_CALL(cudaFree(d_complex_imgSpec));

CUDA_SAFE_CALL(cudaFree(d_imgF));
// this is GPU code

__global__ void NormalizeIFFT(float *g_data, int width, int height, float N)

{

   // index = x * height + y

   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

   int index = yIndex * width + xIndex;

g_data[index] = g_data[index] / N;

}

If you find problems, please post them and I’ll edit this post.

Explanation

FFT stands for Fast Fourier Transform. When you FFT an image, the input data is real (ie, a single float), and the output data is complex (ie, have a real and imaginary part). The output from the cufft plans is of the data type cufftComplex, which has a real part in the .x, and the imaginary part in the .y.

The reason the size of the complex output data is only “about half” of what the real input data is, is because when you do an FFT on an image, the resulting data has a bunch of redundancy. In the CUDA CUFFT Library Manual it describes it as holding only the “non-redundant complex coefficients.”

The redundancy is as follows:

The output FFT for a 256x256 image from CUDA is 129x256 (129 columns wide, WIDTH/2 +1). In MATLAB, the resulting FFT from a call to fft2 with the same data would be 256x256. The “left side” of the MATLAB fft data matches the entire CUDA output data. The right half, has been inverted along center column, and the sign of the imaginary part is reversed. Also, row zero and column zero in the result are special. They are not flipped overall, but they are mirrored, with row zero mirrored along the X axis.

So, that’s kind of hard to explain, so here’s a short example.

Consider an 8x8 output FFT, (from Matlab, for example)

11 12 13 14 15 16 17 18

21 22 23 24 25 26 27 28

31 32 33 34 35 36 37 38

41 42 43 44 45 46 47 48

51 52 53 54 55 56 57 58

61 62 63 64 65 66 67 68

71 72 73 74 75 76 77 78

81 82 83 84 85 86 87 88

In this data, you will find that:

  • ROW 1 is special, COLUMN 1 is special

  • 11 is unique/not repeated anywhere

  • 12 is the inverse of 18

    For example, if 12 is (+2.7 + 1.9i), 18 will be (+2.7 - 1.9i)

    NOTE how this affects the imaginary part only

  • 13 is the inverse of 17

  • 14 is the inverse of 16

  • 15 is unique/not repeated anywhere

  • The area from 22 - 88 is mirrored and

    flipped about the column with 15 at the head

  • 28 is inverse of 82

  • 47 is inverse of 63

  • 88 is inverse of 22

  • 23 is inverse of 87

  • NOTE column 15, is mirrored with itself, so

  • 25 is inverse of 85

  • 35 is inverse of 75

  • 45 is inverse of 65

  • 55 is unique, and kind of the “mirror point”

The CUDA output of the same FFT will look like:

11 12 13 14 15

21 22 23 24 25

31 32 33 34 35

41 42 43 44 45

51 52 53 54 55

61 62 63 64 65

71 72 73 74 75

81 82 83 84 85

This eliminates the need for the FFT library to calculate the redundant data.

As far as I know, this applies to SQUARE FFTs only – I’m not 100% sure what happens if your FFT is not square, but it’s similar.

Hope this helps somebody – comments are welcome and appreciated.