What is the real memory usage of cudaFFT

My program run on Quadro FX 5600 that have 1.5Gb Graphic memory, in that i need to perform 3D fft over the 3 float channels. The program ran fine with 128^3 input. However , it got error with 256^3 inputs, i think due to the lack of memory.

The program need 3 real channel inputs each have 256^3 size. I use R2C to convert data to Fourier domain, process and covert back to spatial domain by C2R FFT. It got error while it performed cuda fft. So my question are:

  • What are maybe the reason that cause cudafft failed. It run fine with 128^3 input, and i test fft with 256^3 input in a separate program and that works
  • what is the real memory usage for these 3 channel ffts . How can i calculate / measure this amount of memory

Any idea is appreciated. Thank you

From the author of CUFFT:

"The heuristics in CUFFT are somewhat complicated, so it’s hard to predict how much temporary storage the library will use.

There are cases where it uses none, and there are cases when it can use up to 3x the size of the transform. It depends on the transform size and the particular FFT algorithm needed for that size (and that maps best to the HW). Even an in-place FFT might use some temporary storage depending on the signal size."

To be sure, you could use cuMemGetInfo() to get the amount of free memory before and after the CUFFT calls.

Not sure whether this is really a memory problem. I wrote Python bindings for CUDA and CUFFT

and tested them against fftw 2.1.5 under Linux. Now here are my results:

  1. nx = 256 , ny = 128 , nz = 256

  2. nx = 256 , ny = 256 , nz = 128

Note that the space requirements for both are the same, yet 1. gives correct results,

while 2. gives erroneous results (the GPU is an 8600GTS with 256 MB):

case 1:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 128 NZ = 256 — doComplex: False

Megabytes needed: 64

206 MB free out of 255 MB

Processing time: 0.213 sec

Gigaflops GPU : 9.07 (256 128 256)

Error CPU initial vs GPU

Avg and max rel error = 2.47e-07 1.39e-06

Processing time: 0.778 sec

Gigaflops CPU : 2.48

Speedup GPU/CPU: 3.66

Error CPU final vs CPU initial

Avg and max rel error = 6.33e-08 4.77e-07

case 2:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 256 NZ = 128 — doComplex: False

Megabytes needed: 64

205 MB free out of 255 MB

Processing time: 0.231 sec

Gigaflops GPU : 8.35 (256 256 128)

Error CPU initial vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Processing time: 0.818 sec

Gigaflops CPU : 2.36

Speedup GPU/CPU: 3.54

Error CPU final vs CPU initial

Avg and max rel error = 6.41e-08 4.81e-07

Error CPU final vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Error CPU final vs GPU

Avg and max rel error = 2.25e-07 1.25e-06

Error CPU final vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

cut-and-paste error: this belongs to case 1:

Error CPU final vs GPU

Avg and max rel error = 2.25e-07 1.25e-06

I posted my reply to Linh Ha, before reading your post. The comment about the heuristics is somewhat unsatisfactory. I would like to use CUFFT in production code, where I calculate nx,ny,nz based on other data and having things fail unpredictably is not an option.

Does this make any sense to the authors of CUFTT?:

NX NY NZ C2errror

256 256 32 no Avg and max rel error = 2.06e-02 3.92e+01

256 256 32 yes Avg and max rel error = 2.10e-07 1.19e-06

256 256 31 no Avg and max rel error = 9.18e-07 8.70e-06

256 256 33 no Avg and max rel error = 3.20e-07 1.98e-06

These results do not make any sense. The error is | V - inverse_fft(fft(V))

with the inverse fft result scaled by 1./(nxnynz).

C2C (yes/no) means complex-complex-copmplex/real-complex-real

Why should nz=32 fail for r-c-r transforms, but be ok for c-c-c transforms,

why should r-c-r then not fail for oddballs like nz=31 or nz=33???

the memory amount needed for nz=32 r-c-r is 16 MB, even with up to 3x

workspace there is plenty of room left on my lowly 256 MB card and

cuMemInfo confirms that.

Sure, that there are no bugs in the CUFFT code?

Answering my previous reply to your post: the problem is with synchronization:

old code:

cuCtxSynchronize()

if doComplex:

    cufftExecC2C(plan,d_A,d_B,CUFFT_FORWARD)

    cufftExecC2C(plan,d_B,d_A,CUFFT_INVERSE)

else:

    cufftExecR2C(plan1,d_A,d_B)

    cufftExecC2R(plan2,d_B,d_A)

cuCtxSynchronize()

new code:

cuCtxSynchronize()

if doComplex:

    cufftExecC2C(plan,d_A,d_B,CUFFT_FORWARD)

    cuCtxSynchronize()

    cufftExecC2C(plan,d_B,d_A,CUFFT_INVERSE)

else:

    cufftExecR2C(plan1,d_A,d_B)

    cuCtxSynchronize()

    cufftExecC2R(plan2,d_B,d_A)

cuCtxSynchronize()

Once cuCtxSynchronize is added after a cufftEXEC call, everything

works fine. (256,256,15) e.g previously gave an execute error,

it no longer does. Maybe this should be mentioned in the documentation,

if it really matters as it seems.