size limit of 1D FFT

Dear all: I want to do 3-dimensional sine FFT via cuFFT,

the procedure is

  1. compute 1-D FFT for dimension z with batch = n1*n2

2 transpose from (x,y,z) to (y,z,x)

  1. compute 1-D FFT for dimension x with batch = n2*n3

  2. transpose from (y,z,x) to (z,x,y)

  3. compute 1-D FFT for dimension y with batch = n1*n3

  4. transpose from (z,x,y) to (x,y,z)

everything is O.K. but I suffer problem on (nx,ny,nz) = (512,512,512)

the problem comes from 1-D FFT, I write simple code to demonstrate this

the following code is doing 1-D FFT R2C with size n * batch

trnasform real d_u (of size batch * n) to complex d_u_hat ( of size batch*(n/2+1) )

[codebox]#include <stdio.h>

#include <assert.h>

#include “global.h”

void randomInit(doublereal* data, unsigned long long int size)

{

for (int i = 0; i < size; ++i){ data[i] = (double)rand() / (double)RAND_MAX; }

}

void test_1D_size_limit( void )

{

#ifdef DO_DOUBLE

int batch = 512 * 512  ; 

int n = 256*2 ;

#else

int batch = 512 * 512  ; 

int n = 511*2 ;

#endif

cufftResult flag ;

cufftHandle  plan ; 

doublereal *d_u ; // device memory

Complex *d_u_hat ; // device memory

doublereal *u ; // host memory

// step 1: random data

u = (doublereal *)malloc( sizeof(doublereal)*batch*n ) ;

assert( u ) ;

randomInit( u,  batch*n ) ; // random data 

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*batch*n) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*batch*n , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*batch*((n>>1) + 1) ) );

#if defined (DO_DOUBLE)

flag = cufftPlan1d(&plan, n, CUFFT_D2Z, batch );

if ( CUFFT_SUCCESS != flag ){printf("Error: cufftPlan1d( CUFFT_D2Z ) fails\n"); }

#else

flag = cufftPlan1d(&plan, n, CUFFT_R2C, batch );

if ( CUFFT_SUCCESS != flag ){ printf("Error: cufftPlan1d( CUFFT_R2C ) fails\n"); }

#endif

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan,       (cufftReal *)d_u, d_u_hat );

#endif

if ( CUFFT_SUCCESS != flag ){

	printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

	printf("error code (cufft) = %d\n", flag);

}

}

[/codebox]

the content of file “global.h” is

[codebox]// “global.h”

#include <cufft.h>

#include <cutil_inline.h>

//#define DO_DOUBLE

#ifdef DO_DOUBLE

typedef double  doublereal ;

typedef cufftDoubleComplex Complex; 

#else

typedef float   doublereal ;

typedef cufftComplex  Complex; 

#endif[/codebox]

  1. batch = 512 * 512, n = 511*2

    the program is O.K.

  2. batch = 512 * 512, n = 512*2

output is

Error (cufftExecR2C): memory size or pointer value too large to fit in 32 bit

error code (cufft) = 6

after searching cufft.h, error code = 6 means “CUFFT_EXEC_FAILED”

so this is not out-of-memory problem,

does this mean that maximum size of 1-D FFT R2C is 512512511*2, which leads to

2GB ( du + du_hat )

ps: my platform is winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda 2.3

sorry, I must correct my testing code,

I use cudaGetErrorString (cudaGetLastError()) to find out error message,

[codebox]// step 1: random generate data

u = (doublereal *)malloc( sizeof(doublereal)*batch*n ) ;

assert( u ) ;

randomInit( u,  batch*n ) ;

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*batch*n) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*batch*n , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*batch*((n>>1) + 1) ) );

#if defined (DO_DOUBLE)

cufftPlan1d(&plan, n, CUFFT_D2Z, batch );

#else

cufftPlan1d(&plan, n, CUFFT_R2C, batch );

#endif

cudaThreadSynchronize();

printf("Error( cufftPlan1d ): %s \n",cudaGetErrorString (cudaGetLastError()));

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan,       (cufftReal *)d_u, d_u_hat );

#endif

cudaThreadSynchronize();

printf("Error(cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){

	printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

	printf("error code (cufft) = %d\n", flag);

}

[/codebox]

if batch = 256 * 256, n = 128 ,then error occurs

output is

Error( cufftPlan1d ): no error

Error(cufftExecR2C): invalid configuration argument

(1) if I try (batch,n) = (256*255, 128), then no error.

This may imply that “1D transform sizes up to 8 million elements”

however if I try (batch,n) = (1, 256* 256 * 128), then no error.

I am confused.

(2) when error “invalid configuration argument” occurs,

why cufftExecR2C return “CUFFT_SUCCESS” such that code segment

if ( CUFFT_SUCCESS != flag ){

		printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

		printf("error code (cufft) = %d\n", flag);

	}

	

does not show any error?

Can anyone repeat my error?

dear all: I write another code to test size limit of 1-D FFT

Given (n1, n2, n3)

  1. compute FFT along z-axis with size 2*(n3+1), batch = n1*n2

  2. compute FFT along x-axis with size 2*(n1+1), batch = n2*n3

  3. compute FFT along y-axis with size 2*(n2+1), batch = n3*n1

after each cufftExecR2C call, we use cudaGetErrorString (cudaGetLastError())

to report its error message.

source is

[codebox]#ifdef DO_DOUBLE

int n1 = 512 ;

int n2 = 512 ;

int n3 = 383 ;

#else

int n1 = 511 ;

int n2 = 511 ;

int n3 = 768 ;

#endif

cufftResult  flag ;

cufftHandle  plan_z, plan_x, plan_y ; 

doublereal *d_u ; // device memory

Complex *d_u_hat ; // device memory

doublereal *u ; // host memory

int  NN = n1*n2 ;

NN = qMax( NN, n2*n3 ) ;

NN = qMax( NN, n3*n1 ) ; // NN = max( n1, n2, n3)

size_t numOfElement = n1*n2*n3 + 2*NN ;

// step 1: random generate data

u = (doublereal *)malloc( sizeof(doublereal)*numOfElement ) ;

assert( u ) ;

randomInit( u,  numOfElement ) ;

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*numOfElement) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*numOfElement , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*numOfElement) );	

#if defined (DO_DOUBLE)

cufftPlan1d(&plan_z, 2*(n3+1), CUFFT_D2Z, n1*n2 );

cufftPlan1d(&plan_x, 2*(n1+1), CUFFT_D2Z, n2*n3 );

cufftPlan1d(&plan_y, 2*(n2+1), CUFFT_D2Z, n3*n1 );

#else

cufftPlan1d(&plan_z, 2*(n3+1), CUFFT_R2C, n1*n2 );

cufftPlan1d(&plan_x, 2*(n1+1), CUFFT_R2C, n2*n3 );

cufftPlan1d(&plan_y, 2*(n2+1), CUFFT_R2C, n3*n1 );

#endif

[/codebox]

code is so long that I divide it into two parts

[codebox]// z-FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_z, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan_z,       (cufftReal *)d_u, d_u_hat );

#endif

cudaThreadSynchronize();

printf("Error(cufft, z-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }	

// x-axis FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_x, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan_x,       (cufftReal *)d_u, d_u_hat );

#endif

cudaThreadSynchronize();

printf("Error(cufft, x-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }

// y-axis FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_y, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan_y,       (cufftReal *)d_u, d_u_hat );

#endif

cudaThreadSynchronize();

printf("Error(cufft, y-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }		[/codebox]

platform: winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda2.3

we consider “float” version and output is summaried as following table

(1) size: total device memory

(2) z-FFT: error message after z-FFT

(3) x-FFT: error message after x-FFT

(4) y-FFT: error message after y-FFT

[codebox]--------------±--------±-----------------------±-----------------------±-----------------------+

n1,n2,n3 | size | z-FFT | x-FFT | y-FFT |

--------------±--------±-----------------------±-----------------------±-----------------------+

255,255,255 | 255 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

511,255,255 | 511 MB | invalid configuration | no error | invalid configuration |

          |         | argument               |                        | argument               |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,256,256 | 516 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

511,511,255 | 1024 MB | invalid configuration | invalid configuration | out of memory |

          |         | argument               | argument               | CUFFT_EXEC_FAILED      |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,512,256 | 1032 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

511,511,511 | 2044 MB | out of memory | out of memory | out of memory |

          |         | CUFFT_EXEC_FAILED      | CUFFT_EXEC_FAILED      | CUFFT_EXEC_FAILED      |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,512,512 | 2056 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,511,767 | 3074 MB | no error | no error | memory size or pointer |

          |         |                        |                        | value too large to fit |

          |         |                        |                        | in 32 bit              |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,512,767 | 3080 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,512,768 | 3084 MB | no error | no error | no error |

--------------±--------±-----------------------±-----------------------±-----------------------+

512,512,1024 | 4112 MB | no error | CUFFT_INVALID_PLAN | CUFFT_INVALID_PLAN |

--------------±--------±-----------------------±-----------------------±-----------------------+[/codebox]

I have no consistent explanation on above table.

for example, (n1,n2,n3) = (511,511,511) shows “out of memory” but

(n1,n2,n3) = (512,512,512) works without error

Does anyone have ideas?

problem is still the same even installing latest cuda 2.3 posted in
http://forums.nvidia.com/index.php?showtop…rt=#entry577902

does anyone have the same experience?

Maybe the problem lies in the memory-bound transpose program. Have you solved it?

I am having the same exact problem. The image processing program that I am working with uses cufftPlan1d CUFFT_R2C followed by CUFFT_C2R to filter images.

If I set 1024 as my transform size (nx) and 1000 (height of the image) as the number of transforms (batch) everything works fine.

If I set 2048 as my transform size (nx) and 2000 (height of the image) as the number of transforms (batch) this is the error I get: CUFFT_EXEC_FAILED

What are the official limits of cufftPlan1d?

BTW my Tesla GPU has 3GB memory.

I am being told these sizes should work, so the problem is likely elsewhere. Are you using CUDA 4.0? Could you attach a small, self-contained, app that reproduces the problem?

The guy who wrote the program originaly was using half height of the image as a dimension of blockdim of cuda kernel operations. So 1024/2 = 512 and is withn limits. 2048/2 exceeds 512 which exceeds the maximum number of threads in a block: 512. See reference below. The next cuda operation to be called was the cufftExecR2C. For whatever reason the first cuda operation wasn’t failing but the cufftExecR2C which followed was. I figure the GPU was fritzed out by the first operation and the next operation to come along crashed.

The total size of a block is limited at 512 threads

From page 2/10 of this reference manual:

http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

It looks like the apps is missing some error checking upstream of the CUFFT call. Since the error status isn’t checked and cleared, it then gets reported on a subsequent CUDA operation, which in this case happened to be inside the CUFFT call. As this can be very confusing(as the present case demonstrates), I recommend checking every API call and every kernel launch. It seems like in your case there is a previous kernel launch that is not checked. To catch both synchronously reported pre-launch errors and asynchronously kernle failures in non-production code I use the following handy macro:

// Macro to catch CUDA errors in kernel launches

#define CHECK_LAUNCH_ERROR()                                          \

do {                                                                  \

    /* Check synchronous errors, i.e. pre-launch */                   \

    cudaError_t err = cudaGetLastError();                             \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \

    err = cudaThreadSynchronize();                                    \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString( err) );      \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)