CUFFT memory usage

I’m trying to do a large batch of 1D power-of-two, in-place, complex-to-complex transforms and I’m interested in the memory usage. I found this thread on the subject:

http://forums.nvidia.com/index.php?showtopic=51036

"The heuristics in CUFFT are somewhat complicated, so it’s hard to predict how much temporary storage the library will use.

There are cases where it uses none, and there are cases when it can use up to 3x the size of the transform. It depends on the transform size and the particular FFT algorithm needed for that size (and that maps best to the HW). Even an in-place FFT might use some temporary storage depending on the signal size."

This is an interesting but not very helpful response. It does state that in some cases no memory is actually used and implies that these are normally in-place transforms. However, the memory is allocated when the plan is created and at this point CUFFT doesn’t know if the transforms are in-place or not. At the moment I’m finding that CUFFT is allocating a large amount of memory that I can’t really spare and I’d like to know if it really needs it.

My memory allocation strategy is made all the more complicated by the fact that the size of my batch of transforms actually depends on how much memory is left over for storing the rest of my data.

Edit: I also note that the source code posted by nVidia as CUFFT 1.1 obviously isn’t. I can see from my cuda_profile.log file that the kernel names should actually end in “sp” (single-pass), “mpsm” (multi-pass shared mem) and “mpgm” (multi-pass global mem).