why cufftPlan needs such many GPU mem?

Dear all,

   I'm using cufft to do some 3dfft work, and I need the batch=n[sup]2[/sup] size. I got some '2:out of memory' or '6:launch timed out' error, when cudaMemcpy() or cufftExecZ2D().

   I found it is cufftPlan1d() that consume lots of my gpu mem. I tested the following code, and confused.
#include <stdio.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

cufftHandle		plan_1;

int main(int argc, char *argv[])

{

	int	i;

	int	ret = 0;

	int	mt = 0;

	int	mf = 0;

	int	ml = 0;

	for( i = 0; i < 1025; i++ ) {

		cuMemGetInfo (&mf,&mt);

		if(cudaSuccess != (ret = cufftPlan1d( &plan_1, i, CUFFT_D2Z, i*i ))){

			//printf( "plan %d : failed.\n", i );

			continue;

		}

		cuMemGetInfo (&ml,&mt);

		//printf( "plan %d : use %f MB GPU mem.\n", i, (float)(mf-ml)/(1024*1024) );

		if( ml == mf ){

			//printf("\n--->plan %d may be a good plan.\n\n", i);

			printf("%d\t", i);

		}

    	cufftDestroy( plan_1 );

	}

return 0;

}

I got following output:

1 2 11 13 17 19 22 23 26 29 31 33 34 37 38 39 41 43 44 46 47 51 52 55 57 58 62 65 66 68 69 74 76 77 78 82 85 86 87 88 91 92 93 94 95 99 102 104 110 111 114 115 116 117 119 121 123 124 129 130 132 133 136 138 141 143 145 148 152 153 154 155 156 161 164 165 169 170 171 172 174 176 182 184 185 186 187 188 190 195 198 203 204 205 207 208 209 215 217 220 221 222 228 230 231 232 233 234 235 236 237 238 239 241 242 243 244 245 246 247 248 249 250 251 253 254 255 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 281 282 283 284 285 286 287 289 290 291 292 293 294 295 296 297 298 299 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 343 512 625 729

when n is some ‘undesirable’ number, such as 127, 256, 175, 202, 227, the plan will consume [font=“Arial Black”]hundreds of MBs[/font] of my gpu mem, which in turn contributes to the failure of my app.

Does this problem exist on your machine? Who can explain this for me?

My config: GeForce GTX 260, the latest cuda. fc10 Linux.

To me, it seems clear cufft will allocate some persistent temporary buffers as part of the plan. This is needed for out of place operation. Given the ~1ms time (my experience) for allocating large GPU buffers, this definitely makes sense. Intel IPP’s FFT (see ippiman.pdf) makes it clear it tries to use persistent temporary buffers: