CUFFT performance issue?

Hello there,
I have multiple datasets of time series.
each time series is 256 points

SETS:
A=220k time series
B= set C + set D 79k time series
C=29k time series
D=50k time series
F= 2*set B

Ok so here is where it gets wierd.
I take the FFT of all these set, A’ B’…
and everything appears to be normal
but then when i take the iFFT it takes very different times,
A=75ms
B=1300ms
C=20ms
D=25ms
F=1300ms
(numbers are ~averages over a few runs each)

does anyone have any tips or pointers to figure out that this may be?

thanks =)

actually i just realized i get similar numbers on the FFT in the beginning…
Im just checking whether some data is corrupt but It passed all tests for NaN and NULL’s

=/

I just parsed though all my input data again, and double checked that everything is in order…
but still get 75-80ms on set A and 900-1300 on set B…

Does this make sense to anyone?

sin() and cos() on the GPU have a fast path and a slow path depending on numeric range.

The slow path uses local memory. Ooops.

Compile everything with --use-fast-math to get the __sinf() and__cosf() which is implemented
in hardware only, but lower precision.

Not sure how to force CUFFT to use fast math only (I don’t know enough about this library)

When in doubt, profile your program on data set B.

hmm i tired -use_fast_math… didnt change anything =/
off to use the profiler then

ok, so the profiler was a no-go… it seems not to like my code… it keeps crashing without displaying an error