CUFFT issues

Hey, I am a babe in the woods trying to do a batched 1d FFT.

NX1=256
NY=4097

First I do this:
cufftSafeCall(cufftPlan1d(&plan, NX1, CUFFT_C2C, NY));
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_charge, (cufftComplex *)d_charge, CUFFT_FORWARD));

This works and yields the expected result (I am parallelizing an existing program)

However, later on I do this:
cufftSafeCall(cufftPlan1d(&plan3, NX1, CUFFT_C2C, NY));
cufftSafeCall(cufftExecC2C(plan3, (cufftComplex *)d_pot, (cufftComplex *)d_pot, CUFFT_INVERSE));

This inverse transform promptly goes to crap, even after normalization. We’re talking just random smatterings of what looks like garbage, interspersed with maybe a few values that look close to what they should be.

However, if I do this:
cufftSafeCall(cufftPlan1d(&plan3, NX1, CUFFT_C2C, 1));
cufftSafeCall(cufftExecC2C(plan3, (cufftComplex *)d_pot, (cufftComplex *)d_pot, CUFFT_INVERSE));

The first batch looks pretty much like it should. In fact, it seems to work okay for batch sizes up to around 250 or so. So I thought that I’d do a for loop of 17 transforms of batch size 241 to get my 4097, walking the pointer forward each time by the appropriate amount. Still the exact same garbage as earlier. In fact, the garbage is pretty consistent. Any ideas on what is going wrong? I read some stuff in the manual about padding, but to be honest, I haven’t a clue what it’s talking about. I tried doing an OOP transform, but that yielded the same familiar crud. Is it possible I’m running out of memory; does the inverse transform just not work on these large data sets? I am parallelizing a PIC space-plasma simulation for an Undergrad design project, in case anyone’s interested.

Oh, and the stuff I am transforming is in a float2, just like the SDK example.