little to none improvement with CUDA in Matlab CUDA produces little acceleration of FFT computation

I finally made the MATLAB plug-in work on my computer. But the result is disappointing: CUDA produces little acceleration of FFT computation, as shown below:

============================================
---- Run native Matlab simulations ----

which Szeta
C:\Matlab_CUDA-1.1a\Szeta.mexw64

tic; FS_2Dturb(128,1,1,1); toc;

CFL = 0.1017

Gsqav = 1.1995

Elapsed time is 4.856168 seconds.

tic; FS_vortex; toc;

ans = 512

Elapsed time is 24.564927 seconds.

---- Compile the CUDA source and rerun the simulations with acceleration ----

nvmex -f nvmexopts.bat Szeta.cu -IC:\cuda\include -LC:\cuda\lib64 -lcufft -lcudart
abdelali target arch: win64
Szeta.cu
tmpxft_00000dac_00000000-3_Szeta.cudafe1.gpu
tmpxft_00000dac_00000000-8_Szeta.cudafe2.gpu
tmpxft_00000dac_00000000-3_Szeta.cudafe1.cpp
which Szeta
C:\Matlab_CUDA-1.1a\Szeta.mexw64

CFL = 0.1017

Gsqav = 1.1995

Elapsed time is 4.834929 seconds.

tic; FS_vortex; toc;

ans = 512

Elapsed time is 23.179676 seconds.

The improvement – if there is any – is imperceptible (<6% ). The computing environment is as follow:
Windows 7 pro 64bit,
Matlab2009a,
VS2008pro,
CUDA 2.3 (driver: cudadriver_2.3_winvista_64_190.38_general.exe)
Dell Precision workstation (CPU: Intel Xeon 3.3G)
Quadro FX 3800

Any suggestions on how to make CUDA work better?
speed_fft_results.png

Just curious if you’ve tried out Jacket and, if so, how it worked for you?
-John