allocation problem in cuFFT

Dear all:

I try to check maximum problem size of cuFFT on 3D data.

my platform: winxp pro64, vc2005, GTX295 + Tesla C1060, driver 109.38, cuda 2.3

I use Tesla C1060 as computational kernel and focus on “double precision”

first I can allocate 3.7GB to do FFT (real to complex) as code 1

nx = ny = nz = 628

code 1

[codebox]

cufftResult ret ;

size_t nx = 628 ;

size_t ny = 628 ;

size_t nz = 628 ;

size_t N  = nx * ny * nz ;

double    *d_u ;	 // device of source 

double    *d_u_hat ; // device of frequency component of source 

cufftHandle   plan ;  // plane of forward R2C transform

cutilSafeCall( cudaMalloc((void**)&d_u,     sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );

ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );	

if ( CUFFT_SUCCESS != ret ){

if ( CUFFT_ALLOC_FAILED == ret ){

    cout << "Error: Allocation of GPU resources for the plan failed" << endl ;

}else{

    cout << "Error: cufftPlan3d fails for other reason" << endl ;	

}

}else{

cout << "cufftPlan3d success" << endl ;

} // if (CUFFT_SUCCESS != ret )

[/codebox]

Second if I want to do the same FFT for 2 data set but small size

nx = ny = nz = 480, then error occurs at

cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

with error ID = CUFFT_ALLOC_FAILED (see code 2)

code 2

[codebox]

cufftResult ret ;

size_t nx = 240*2 ;

size_t ny = 240*2 ;

size_t nz = 240*2 ;

size_t N = nx * ny * nz ;

double *d_G_hat ; // device of frequency component of kernel

double *d_u ; // device of source

double *d_u_hat ; // device of frequency component of source

double *d_G ; // d_G <–> h_G

cufftHandle plan ; // plane of forward R2C transform

cutilSafeCall( cudaMalloc((void**)&d_G_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_G, sizeof(double)*N) );

ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

if ( CUFFT_SUCCESS != ret ){

if ( CUFFT_ALLOC_FAILED == ret ){

   cout << "Error: Allocation of GPU resources for the plan failed" << endl ;

}else{

   cout << "Error: cufftPlan3d fails for other reason" << endl ;	

}

}else{

cout << “cufftPlan3d success” << endl ;

} // if (CUFFT_SUCCESS != ret)

[/codebox]

moreover if I interchange

cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

and

cutilSafeCall( cudaMalloc((void**)&d_G, sizeof(double)*N) );

then error occurs at

	cutilSafeCall( cudaMalloc((void**)&d_G, sizeof(double)*N) );

it says “out of memory” (see code 3)

code 3

[codebox]

cufftResult ret ;

size_t nx = 240*2 ;

size_t ny = 240*2 ;

size_t nz = 240*2 ;

size_t N = nx * ny * nz ;

double *d_G_hat ; // device of frequency component of kernel

double *d_u ; // device of source

double *d_u_hat ; // device of frequency component of source

double *d_G ; // d_G <–> h_G

cufftHandle plan ; // plane of forward R2C transform

cutilSafeCall( cudaMalloc((void**)&d_G_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );

ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

if ( CUFFT_SUCCESS != ret ){

if ( CUFFT_ALLOC_FAILED == ret ){

   cout << "Error: Allocation of GPU resources for the plan failed" << endl ;

}else{

   cout << "Error: cufftPlan3d fails for other reason" << endl ;	

}

}else{

cout << “cufftPlan3d success” << endl ;

} // if ( CUFFT_SUCCESS != ret )

cutilSafeCall( cudaMalloc((void**)&d_G, sizeof(double)*N) );

[/codebox]

However above code only uses 3.3GB, smaller than 3.7GB in code 1.

I cannot understand why “out of memory” occurs.

Dear all:

I test code 2 with “single precision”, then

I can allocate 3.78GB (nx = ny = nz = 314*2 ), this is large than that in

“double precision” which only has 3.29GB ( nx = ny = nz = 238*2 )

Does anyone know why?

dear all:

I use runtme API cuMemGetInfo() to extract memory usage and have some explanation for my question

Generally speaking, I add cuMemGetInfo() to sample memory usage before cufftPlan3d() and after cufftPlan3d(),

then watch how large device memory does cufftPlan3d() need?

I think that cufftPlan3d() needs extra global memory as working space, the reason comes from CUFFT_Library_2.3.pdf

page 13, Accuracy and Performance

The CUFFT library implements several FFT algorithms, each having

different performance and accuracy. The best performance paths

correspond to transform sizes that meet two criteria:

  1. Fit in CUDA?s shared memory

  2. Are powers of a single factor (for example, powers of two)

For transform sizes that do not meet either criteria above, CUFFT uses

an out-of-place, mixed-radix algorithm that stores all intermediate

results in CUDA’s global GPU memory.

       ^^^^^^^^^^^^^^^^^^^^^^^^^

first I modify code 3 as

[codebox] size_t nx = 2402 ; size_t ny = 2402 ; size_t nz = 240*2 ;

size_t N  = nx * ny * nz ;

double    *d_G_hat ; // device of frequency component of kernel

double    *d_u ;	 // device of source 

double    *d_u_hat ; // device of frequency component of source 

double    *d_G ;     // d_G <--> h_G  

cufftHandle   plan ;  // plane of forward R2C transform

cutilSafeCall( cudaMalloc((void**)&d_G_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u,     sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );

unsigned int free_mem,total_mem, used_mem;

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("before plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0, ((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); 

cufftResult ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("after plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0, ((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 );	

cutilSafeCall( cudaMalloc((void**)&d_G,     sizeof(double)*N) );	[/codebox]

The error message is

[codebox]before plan3d:total mem: 4095.750 MB, free: 1518.188 MB, used : 2577.563 MB

after plan3d:total mem: 4095.750 MB, free: 672.672 MB, used : 3423.078 MB

cudaSafeCall() Runtime API error in file <method3.cpp>, line 64 : out of memory.[/codebox]

this means taht cufftPlan3d() asks 343.078 - 2577.563 = 845MB global memory, then

next statement “cudaMalloc((void**)&d_G, sizeof(double)*N)” fails.

here nx = ny = nz = 240*2 = 2^5 x 3 x 5 is not power of single factor, so cufftPlan3d()

needs extrac memory to do out-of-place work.

second, I modify code 2 again

[codebox] size_t nx = 2402 ; size_t ny = 2402 ; size_t nz = 240*2 ;

size_t N  = nx * ny * nz ;

double    *d_G_hat ; // device of frequency component of kernel

double    *d_u ;	 // device of source 

double    *d_u_hat ; // device of frequency component of source 

double	  *d_G ;     // d_G <--> h_G  

cufftHandle   plan ;  // plane of forward R2C transform

cutilSafeCall( cudaMalloc((void**)&d_G_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u,     sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );

cutilSafeCall( cudaMalloc((void**)&d_G,     sizeof(double)*N) );	

unsigned int free_mem,total_mem, used_mem;

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("before plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0,((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); 	

cufftResult ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("after plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0, ((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); 

if ( CUFFT_ALLOC_FAILED == ret ){       

		cout << "Error: Allocation of GPU resources for the plan failed" << endl ; 

}[/codebox]

the error message is

[codebox]before plan3d:total mem: 4095.750 MB, free: 674.438 MB, used : 3421.313 MB

after plan3d:total mem: 4095.750 MB, free: 674.438 MB, used : 3421.313 MB

Error: Allocation of GPU resources for the plan failed[/codebox]

It seems that cufftPlan3d() does not ask any extra memory, this is wrong since

from code 3, we know cufftPlan3d() needs 845MB extra memory. however cufftPlan3d()

cannot allocate such working space since 3421.313 + 845 ~ 4.167GB (out-of-memory),

hence it returns “CUFFT_ALLOC_FAILED”

finally we modify code 1

[codebox] size_t nx = 628 ; size_t ny = 628 ; size_t nz = 628 ;

size_t N  = nx * ny * nz ;	

double    *d_u ;	 // device of source 	

double    *d_u_hat ; // device of frequency component of source 	

cufftHandle   plan ;  // plane of forward R2C transform	

cutilSafeCall( cudaMalloc((void**)&d_u,     sizeof(double)*N) );	

cutilSafeCall( cudaMalloc((void**)&d_u_hat, sizeof(double)*N) );	

unsigned int free_mem,total_mem, used_mem;

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("before plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0, ((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); 

cufftResult ret = cufftPlan3d(&plan, nx, ny, nz, CUFFT_D2Z );	

cuMemGetInfo( &free_mem, &total_mem );

used_mem = total_mem-free_mem;

printf("after plan3d:total mem: %0.3f MB, free: %0.3f MB, used : %0.3f MB\n",

	((double)total_mem)/1024.0/1024.0, ((double)free_mem )/1024.0/1024.0, 

	((double)used_mem )/1024.0/1024.0 ); [/codebox]

output message is

[codebox]before plan3d:total mem: 4095.750 MB, free: 270.215 MB, used : 3825.535 MB

after plan3d:total mem: 4095.750 MB, free: 222.043 MB, used : 3873.707 MB[/codebox]

cufftPlan3d() only asks 3873.707 - 3825.535 = 48.2 MB, and total memory is 3873.7MB < 4GB,

hence it works without any error.

Question: I can understand that cufftPlan3d() needs extra memory for data size is not power of

single factor, however I cannot understand why

  1. n = 480 = 2^5 x 3 x 5, cufftPlan3d() needs 845MB

  2. n = 628 = 4 x 157, cufftPlan3d() only needs 48.2MB

ps: if I use nx=ny=nz= 238*2, then cufftPlan3d() needs 194MB extra memory, it seems that

size of working space has large variation.