Increasing Speed of fft2_cuda I'm not getting the speeds I expected

I am trying to speed up a MatLab simulation which is very fft2 heavy. I downloaded the MatLab plug-in for CUDA and compiled the fft2_cuda, fft2_cuda_sp_dp and ifft2_cuda functions. The speed for a typical simulation has improved from 1373.45 seconds to 1078.65 seconds, but I was hoping for more. Is there a way to speed up the calls to the GPU.

I am running MatLab 7.20.232 (R2006a)
I am compiling with Microsoft Visual C/C++ version 8.0
My GPU is a Ge-Force 8800 GT