What is the real memory usage of cudaFFT

Linh_Ha · November 12, 2007, 9:37pm

My program run on Quadro FX 5600 that have 1.5Gb Graphic memory, in that i need to perform 3D fft over the 3 float channels. The program ran fine with 128^3 input. However , it got error with 256^3 inputs, i think due to the lack of memory.

The program need 3 real channel inputs each have 256^3 size. I use R2C to convert data to Fourier domain, process and covert back to spatial domain by C2R FFT. It got error while it performed cuda fft. So my question are:

What are maybe the reason that cause cudafft failed. It run fine with 128^3 input, and i test fft with 256^3 input in a separate program and that works
what is the real memory usage for these 3 channel ffts . How can i calculate / measure this amount of memory

Any idea is appreciated. Thank you

Simon_Green · November 15, 2007, 10:06am

From the author of CUFFT:

"The heuristics in CUFFT are somewhat complicated, so it’s hard to predict how much temporary storage the library will use.

There are cases where it uses none, and there are cases when it can use up to 3x the size of the transform. It depends on the transform size and the particular FFT algorithm needed for that size (and that maps best to the HW). Even an in-place FFT might use some temporary storage depending on the signal size."

To be sure, you could use cuMemGetInfo() to get the amount of free memory before and after the CUFFT calls.

apaehler · January 20, 2008, 4:51pm

My program run on Quadro FX 5600 that have 1.5Gb Graphic memory, in that i need to perform 3D fft over the 3 float channels. The program ran fine with 128^3 input. However , it got error with 256^3 inputs, i think due to the lack of memory.

The program need 3 real channel inputs each have 256^3 size. I use R2C to convert data to Fourier domain, process and covert back to spatial domain by C2R FFT. It got error while it performed cuda fft. So my question are:

What are maybe the reason that cause cudafft failed. It run fine with 128^3 input, and i test fft with 256^3 input in a separate program and that works

what is the real memory usage for these 3 channel ffts . How can i calculate / measure this amount of memory

Any idea is appreciated. Thank you

[snapback]278590[/snapback]

Not sure whether this is really a memory problem. I wrote Python bindings for CUDA and CUFFT

and tested them against fftw 2.1.5 under Linux. Now here are my results:

nx = 256 , ny = 128 , nz = 256
nx = 256 , ny = 256 , nz = 128

Note that the space requirements for both are the same, yet 1. gives correct results,

while 2. gives erroneous results (the GPU is an 8600GTS with 256 MB):

case 1:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 128 NZ = 256 — doComplex: False

Megabytes needed: 64

206 MB free out of 255 MB

Processing time: 0.213 sec

Gigaflops GPU : 9.07 (256 128 256)

Error CPU initial vs GPU

Avg and max rel error = 2.47e-07 1.39e-06

Processing time: 0.778 sec

Gigaflops CPU : 2.48

Speedup GPU/CPU: 3.66

Error CPU final vs CPU initial

Avg and max rel error = 6.33e-08 4.77e-07

case 2:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 256 NZ = 128 — doComplex: False

Megabytes needed: 64

205 MB free out of 255 MB

Processing time: 0.231 sec

Gigaflops GPU : 8.35 (256 256 128)

Error CPU initial vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Processing time: 0.818 sec

Gigaflops CPU : 2.36

Speedup GPU/CPU: 3.54

Error CPU final vs CPU initial

Avg and max rel error = 6.41e-08 4.81e-07

Error CPU final vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Error CPU final vs GPU

Avg and max rel error = 2.25e-07 1.25e-06

apaehler · January 20, 2008, 5:04pm

Not sure whether this is really a memory problem. I wrote Python bindings for CUDA and CUFFT

and tested them against fftw 2.1.5 under Linux. Now here are my results:

nx = 256 , ny = 128 , nz = 256

nx = 256 , ny = 256 , nz = 128

Note that the space requirements for both are the same, yet 1. gives correct results,

while 2. gives erroneous results (the GPU is an 8600GTS with 256 MB):

case 1:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 128 NZ = 256 — doComplex: False

Megabytes needed: 64

206 MB free out of 255 MB

Processing time: 0.213 sec

Gigaflops GPU : 9.07 (256 128 256)

Error CPU initial vs GPU

Avg and max rel error = 2.47e-07 1.39e-06

Processing time: 0.778 sec

Gigaflops CPU : 2.48

Speedup GPU/CPU: 3.66

Error CPU final vs CPU initial

Avg and max rel error = 6.33e-08 4.77e-07

case 2:

±-----------------------+

| Fast Fourier Transform |

| using CUDA driver API |

±-----------------------+

NX = 256 NY = 256 NZ = 128 — doComplex: False

Megabytes needed: 64

205 MB free out of 255 MB

Processing time: 0.231 sec

Gigaflops GPU : 8.35 (256 256 128)

Error CPU initial vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Processing time: 0.818 sec

Gigaflops CPU : 2.36

Speedup GPU/CPU: 3.54

Error CPU final vs CPU initial

Avg and max rel error = 6.41e-08 4.81e-07

Error CPU final vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

Error CPU final vs GPU

Avg and max rel error = 2.25e-07 1.25e-06

[snapback]312642[/snapback]

Error CPU final vs GPU

Avg and max rel error = 6.35e-03 5.10e+01

cut-and-paste error: this belongs to case 1:

Error CPU final vs GPU

Avg and max rel error = 2.25e-07 1.25e-06

apaehler · January 21, 2008, 7:12pm

I posted my reply to Linh Ha, before reading your post. The comment about the heuristics is somewhat unsatisfactory. I would like to use CUFFT in production code, where I calculate nx,ny,nz based on other data and having things fail unpredictably is not an option.

Does this make any sense to the authors of CUFTT?:

NX NY NZ C2errror

256 256 32 no Avg and max rel error = 2.06e-02 3.92e+01

256 256 32 yes Avg and max rel error = 2.10e-07 1.19e-06

256 256 31 no Avg and max rel error = 9.18e-07 8.70e-06

256 256 33 no Avg and max rel error = 3.20e-07 1.98e-06

These results do not make any sense. The error is | V - inverse_fft(fft(V))

with the inverse fft result scaled by 1./(nxnynz).

C2C (yes/no) means complex-complex-copmplex/real-complex-real

Why should nz=32 fail for r-c-r transforms, but be ok for c-c-c transforms,

why should r-c-r then not fail for oddballs like nz=31 or nz=33???

the memory amount needed for nz=32 r-c-r is 16 MB, even with up to 3x

workspace there is plenty of room left on my lowly 256 MB card and

cuMemInfo confirms that.

Sure, that there are no bugs in the CUFFT code?

apaehler · January 21, 2008, 8:37pm

Answering my previous reply to your post: the problem is with synchronization:

old code:

cuCtxSynchronize()

if doComplex:

    cufftExecC2C(plan,d_A,d_B,CUFFT_FORWARD)

    cufftExecC2C(plan,d_B,d_A,CUFFT_INVERSE)

else:

    cufftExecR2C(plan1,d_A,d_B)

    cufftExecC2R(plan2,d_B,d_A)

cuCtxSynchronize()

new code:

cuCtxSynchronize()

if doComplex:

    cufftExecC2C(plan,d_A,d_B,CUFFT_FORWARD)

    cuCtxSynchronize()

    cufftExecC2C(plan,d_B,d_A,CUFFT_INVERSE)

else:

    cufftExecR2C(plan1,d_A,d_B)

    cuCtxSynchronize()

    cufftExecC2R(plan2,d_B,d_A)

cuCtxSynchronize()

Once cuCtxSynchronize is added after a cufftEXEC call, everything

works fine. (256,256,15) e.g previously gave an execute error,

it no longer does. Maybe this should be mentioned in the documentation,

if it really matters as it seems.

Topic		Replies	Views
CUFFT memory usage CUDA Programming and Performance	1	1429	May 6, 2011
allocation problem in cuFFT CUDA Programming and Performance	2	2617	September 16, 2009
cufftPlan2d fails CUDA Programming and Performance	14	21135	September 17, 2007
CUFFT 1D Memory Usage Inconsistencies CUDA Programming and Performance	1	2887	September 25, 2008
memory allocation jumps in cufftplan3d sudden increase in GPU memory allocation CUDA Programming and Performance	12	9854	November 4, 2010
Does cufftPlan3d allocate additional memory? Why? CUDA Programming and Performance	1	1134	April 7, 2009
What is the maximum size for CUDA FFT CUDA Programming and Performance	4	4621	September 14, 2007
cufftPlan3d Device Memory Usage Large memory usage creating fft plan CUDA Programming and Performance	1	1955	May 18, 2012
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13736	October 27, 2010
size limit of 1D FFT CUDA Programming and Performance	8	2647	September 24, 2011

What is the real memory usage of cudaFFT

Related topics