size limit of 1D FFT

LSChien · September 4, 2009, 3:40am

Dear all: I want to do 3-dimensional sine FFT via cuFFT,

the procedure is

compute 1-D FFT for dimension z with batch = n1*n2

2 transpose from (x,y,z) to (y,z,x)

compute 1-D FFT for dimension x with batch = n2*n3
transpose from (y,z,x) to (z,x,y)
compute 1-D FFT for dimension y with batch = n1*n3
transpose from (z,x,y) to (x,y,z)

everything is O.K. but I suffer problem on (nx,ny,nz) = (512,512,512)

the problem comes from 1-D FFT, I write simple code to demonstrate this

the following code is doing 1-D FFT R2C with size n * batch

trnasform real d_u (of size batch * n) to complex d_u_hat ( of size batch*(n/2+1) )

[codebox]#include <stdio.h>

#include <assert.h>

#include “global.h”

void randomInit(doublereal* data, unsigned long long int size)

{

for (int i = 0; i < size; ++i){ data[i] = (double)rand() / (double)RAND_MAX; }

}

void test_1D_size_limit( void )

{

#ifdef DO_DOUBLE

int batch = 512 * 512  ; 

int n = 256*2 ;

#else

int batch = 512 * 512  ; 

int n = 511*2 ;

#endif

cufftResult flag ;

cufftHandle  plan ; 

doublereal *d_u ; // device memory

Complex *d_u_hat ; // device memory

doublereal *u ; // host memory

// step 1: random data

u = (doublereal *)malloc( sizeof(doublereal)*batch*n ) ;

assert( u ) ;

randomInit( u,  batch*n ) ; // random data

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*batch*n) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*batch*n , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*batch*((n>>1) + 1) ) );

#if defined (DO_DOUBLE)

flag = cufftPlan1d(&plan, n, CUFFT_D2Z, batch );

if ( CUFFT_SUCCESS != flag ){printf("Error: cufftPlan1d( CUFFT_D2Z ) fails\n"); }

#else

flag = cufftPlan1d(&plan, n, CUFFT_R2C, batch );

if ( CUFFT_SUCCESS != flag ){ printf("Error: cufftPlan1d( CUFFT_R2C ) fails\n"); }

#endif

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u_hat );

#else

flag = cufftExecR2C( plan,       (cufftReal *)d_u, d_u_hat );

#endif

if ( CUFFT_SUCCESS != flag ){

	printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

	printf("error code (cufft) = %d\n", flag);

}

}

[/codebox]

the content of file “global.h” is

[codebox]// “global.h”

#include <cufft.h>

#include <cutil_inline.h>

//#define DO_DOUBLE

#ifdef DO_DOUBLE

typedef double  doublereal ;

typedef cufftDoubleComplex Complex;

#else

typedef float   doublereal ;

typedef cufftComplex  Complex;

#endif[/codebox]

batch = 512 * 512, n = 511*2

the program is O.K.
batch = 512 * 512, n = 512*2

output is

Error (cufftExecR2C): memory size or pointer value too large to fit in 32 bit

error code (cufft) = 6

after searching cufft.h, error code = 6 means “CUFFT_EXEC_FAILED”

so this is not out-of-memory problem,

does this mean that maximum size of 1-D FFT R2C is 512512511*2, which leads to

2GB ( du + du_hat )

ps: my platform is winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda 2.3

LSChien · September 4, 2009, 3:51pm

the content of file “global.h” is

[codebox]// “global.h”

include <cufft.h>

include <cutil_inline.h>

//define DO_DOUBLE

ifdef DO_DOUBLE
typedef double  doublereal ;

typedef cufftDoubleComplex Complex; 
else
typedef float   doublereal ;

typedef cufftComplex  Complex; 
endif[/codebox]

ps: my platform is winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda 2.3

sorry, I must correct my testing code,

I use cudaGetErrorString (cudaGetLastError()) to find out error message,

[codebox]// step 1: random generate data

u = (doublereal *)malloc( sizeof(doublereal)*batch*n ) ;

assert( u ) ;

randomInit( u,  batch*n ) ;

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*batch*n) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*batch*n , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*batch*((n>>1) + 1) ) );

#if defined (DO_DOUBLE)

cufftPlan1d(&plan, n, CUFFT_D2Z, batch );

else

cufftPlan1d(&plan, n, CUFFT_R2C, batch );

endif

cudaThreadSynchronize();

printf("Error( cufftPlan1d ): %s \n",cudaGetErrorString (cudaGetLastError()));

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan, (cufftDoubleReal *)d_u, d_u_hat );

else

flag = cufftExecR2C( plan,       (cufftReal *)d_u, d_u_hat );

endif

cudaThreadSynchronize();

printf("Error(cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){

	printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

	printf("error code (cufft) = %d\n", flag);

}

[/codebox]

if batch = 256 * 256, n = 128 ,then error occurs

output is

Error( cufftPlan1d ): no error

Error(cufftExecR2C): invalid configuration argument

(1) if I try (batch,n) = (256*255, 128), then no error.

This may imply that “1D transform sizes up to 8 million elements”

however if I try (batch,n) = (1, 256* 256 * 128), then no error.

I am confused.

(2) when error “invalid configuration argument” occurs,

why cufftExecR2C return “CUFFT_SUCCESS” such that code segment

if ( CUFFT_SUCCESS != flag ){

		printf("Error (cufftExecR2C): %s \n",cudaGetErrorString (cudaGetLastError()));

		printf("error code (cufft) = %d\n", flag);

	}

	

does not show any error?

Can anyone repeat my error?

LSChien · September 5, 2009, 8:05am

the content of file “global.h” is

[codebox]// “global.h”

include <cufft.h>

include <cutil_inline.h>

//define DO_DOUBLE

ifdef DO_DOUBLE
typedef double  doublereal ;

typedef cufftDoubleComplex Complex; 
else
typedef float   doublereal ;

typedef cufftComplex  Complex; 
endif[/codebox]

ps: my platform is winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda 2.3

dear all: I write another code to test size limit of 1-D FFT

Given (n1, n2, n3)

compute FFT along z-axis with size 2*(n3+1), batch = n1*n2
compute FFT along x-axis with size 2*(n1+1), batch = n2*n3
compute FFT along y-axis with size 2*(n2+1), batch = n3*n1

after each cufftExecR2C call, we use cudaGetErrorString (cudaGetLastError())

to report its error message.

source is

[codebox]ifdef DO_DOUBLE

int n1 = 512 ;

int n2 = 512 ;

int n3 = 383 ;

else

int n1 = 511 ;

int n2 = 511 ;

int n3 = 768 ;

endif

cufftResult  flag ;

cufftHandle  plan_z, plan_x, plan_y ; 

doublereal *d_u ; // device memory

Complex *d_u_hat ; // device memory

doublereal *u ; // host memory

int  NN = n1*n2 ;

NN = qMax( NN, n2*n3 ) ;

NN = qMax( NN, n3*n1 ) ; // NN = max( n1, n2, n3)

size_t numOfElement = n1*n2*n3 + 2*NN ;

// step 1: random generate data

u = (doublereal *)malloc( sizeof(doublereal)*numOfElement ) ;

assert( u ) ;

randomInit( u,  numOfElement ) ;

// step 2: out-of-place forward FFT in device

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(doublereal)*numOfElement) );

CUDA_SAFE_CALL(cudaMemcpy( d_u, u, sizeof(doublereal)*numOfElement , cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(Complex)*numOfElement) );

#if defined (DO_DOUBLE)

cufftPlan1d(&plan_z, 2*(n3+1), CUFFT_D2Z, n1*n2 );

cufftPlan1d(&plan_x, 2*(n1+1), CUFFT_D2Z, n2*n3 );

cufftPlan1d(&plan_y, 2*(n2+1), CUFFT_D2Z, n3*n1 );

else

cufftPlan1d(&plan_z, 2*(n3+1), CUFFT_R2C, n1*n2 );

cufftPlan1d(&plan_x, 2*(n1+1), CUFFT_R2C, n2*n3 );

cufftPlan1d(&plan_y, 2*(n2+1), CUFFT_R2C, n3*n1 );

endif

[/codebox]

code is so long that I divide it into two parts

[codebox]// z-FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_z, (cufftDoubleReal *)d_u, d_u_hat );

else

flag = cufftExecR2C( plan_z,       (cufftReal *)d_u, d_u_hat );

endif

cudaThreadSynchronize();

printf("Error(cufft, z-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }

// x-axis FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_x, (cufftDoubleReal *)d_u, d_u_hat );

else

flag = cufftExecR2C( plan_x,       (cufftReal *)d_u, d_u_hat );

endif

cudaThreadSynchronize();

printf("Error(cufft, x-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }

// y-axis FFT

#if defined (DO_DOUBLE)

flag = cufftExecD2Z( plan_y, (cufftDoubleReal *)d_u, d_u_hat );

else

flag = cufftExecR2C( plan_y,       (cufftReal *)d_u, d_u_hat );

endif

cudaThreadSynchronize();

printf("Error(cufft, y-axis): %s \n",cudaGetErrorString (cudaGetLastError()));

if ( CUFFT_SUCCESS != flag ){ printf("error code (cufft) = %d\n", flag); }		[/codebox]

platform: winxp pro64, vc2005, Tesla C1060, driver 190.38, cuda2.3

we consider “float” version and output is summaried as following table

(1) size: total device memory

(2) z-FFT: error message after z-FFT

(3) x-FFT: error message after x-FFT

(4) y-FFT: error message after y-FFT

[codebox]--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

          |         | argument               |                        | argument               |

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

          |         | argument               | argument               | CUFFT_EXEC_FAILED      |

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

          |         | CUFFT_EXEC_FAILED      | CUFFT_EXEC_FAILED      | CUFFT_EXEC_FAILED      |

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

          |         |                        |                        | value too large to fit |

          |         |                        |                        | in 32 bit              |

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+

--------------±--------±-----------------------±-----------------------±-----------------------+[/codebox]

I have no consistent explanation on above table.

for example, (n1,n2,n3) = (511,511,511) shows “out of memory” but

(n1,n2,n3) = (512,512,512) works without error

Does anyone have ideas?

LSChien · September 15, 2009, 8:48am

problem is still the same even installing latest cuda 2.3 posted in
[url=“http://forums.nvidia.com/index.php?showtopic=102548&pid=577902&mode=threaded&start=#entry577902”]http://forums.nvidia.com/index.php?showtop...rt=#entry577902[/url]

does anyone have the same experience?

seismic1 · January 5, 2011, 1:32am

Maybe the problem lies in the memory-bound transpose program. Have you solved it?

jtuszynski · September 21, 2011, 7:13pm

I am having the same exact problem. The image processing program that I am working with uses cufftPlan1d CUFFT_R2C followed by CUFFT_C2R to filter images.

If I set 1024 as my transform size (nx) and 1000 (height of the image) as the number of transforms (batch) everything works fine.

If I set 2048 as my transform size (nx) and 2000 (height of the image) as the number of transforms (batch) this is the error I get: CUFFT_EXEC_FAILED

What are the official limits of cufftPlan1d?

BTW my Tesla GPU has 3GB memory.

njuffa · September 21, 2011, 8:39pm

I am being told these sizes should work, so the problem is likely elsewhere. Are you using CUDA 4.0? Could you attach a small, self-contained, app that reproduces the problem?

jtuszynski · September 23, 2011, 10:39pm

The guy who wrote the program originaly was using half height of the image as a dimension of blockdim of cuda kernel operations. So 1024/2 = 512 and is withn limits. 2048/2 exceeds 512 which exceeds the maximum number of threads in a block: 512. See reference below. The next cuda operation to be called was the cufftExecR2C. For whatever reason the first cuda operation wasn’t failing but the cufftExecR2C which followed was. I figure the GPU was fritzed out by the first operation and the next operation to come along crashed.

The total size of a block is limited at 512 threads

From page 2/10 of this reference manual:

njuffa · September 24, 2011, 5:51am

It looks like the apps is missing some error checking upstream of the CUFFT call. Since the error status isn’t checked and cleared, it then gets reported on a subsequent CUDA operation, which in this case happened to be inside the CUFFT call. As this can be very confusing(as the present case demonstrates), I recommend checking every API call and every kernel launch. It seems like in your case there is a previous kernel launch that is not checked. To catch both synchronously reported pre-launch errors and asynchronously kernle failures in non-production code I use the following handy macro:

// Macro to catch CUDA errors in kernel launches

#define CHECK_LAUNCH_ERROR()                                          \

do {                                                                  \

    /* Check synchronous errors, i.e. pre-launch */                   \

    cudaError_t err = cudaGetLastError();                             \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \

    err = cudaThreadSynchronize();                                    \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString( err) );      \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)

Topic		Replies	Views
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1649	October 12, 2021
allocation problem in cuFFT CUDA Programming and Performance	2	2555	September 16, 2009
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3727	October 19, 2022
Trouble with cuFFT on multiple GPUs GPU-Accelerated Libraries	13	3683	August 26, 2017
Inconsistent Cuda Errors CUDA Programming and Performance	10	1056	July 7, 2017
cufftPlan2d fails CUDA Programming and Performance	14	21012	September 17, 2007
cufftXt batch 1D GPU-Accelerated Libraries	12	2149	October 15, 2019
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	4998	October 29, 2019
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241028	November 16, 2011
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	701	September 29, 2018

size limit of 1D FFT

Related topics