Issues with Cuda 2.3 + VS 2008 + Matlab 2008a + Vista x64

I’m running:
Quadro FX 5800 (4gb memory, 240 cores)
Core2Duo 3ghz 8gb ram
Cuda 2.3
VS 2008
Matlab 2008a
Vista x64

I installed and compiled the following: http://developer.nvidia.com/object/matlab_cuda.html

However, the readme file says:

tic; FS_vortex; toc;
ans =
512
Elapsed time is 15.164892 seconds.

With the cpu only (Szeta.m) then I get:

tic; FS_vortex; toc;
ans =
512
Elapsed time is 50.030497 seconds.

With the GPU enabled I get:
ans =
512
Elapsed time is 32.406742 seconds.

It seems my GPU is much slower (2x slower than readme). However, I should have the top of the line graphics card. Any ideas why the performance hit?

Also, I am trying to run my own matlab mex files with cuda.

I wrote test.cu and mytest.cpp. I compiled it with:
nvmex -f nvmexopts.bat mytest.cpp test.cu -IC:\cuda\include -LC:\cuda\lib64 -lcufft -lcudart

and ran it:

a=mytest([6 3 2 2 1], [5 6 8 2 5])

a =
1.0e+020 *
-8.6667 -7.5077 -6.3486 -5.5463 -4.9667

Basically it returns garbage.

However, if I change test.cu to:
cudaMemcpy(C,Ad,5*sizeof(double),cudaMemcpyDeviceToHost);

it returns:
a =
6 3 2 2 1

Which indicates that the variable is getting copied into CUDA memory correctly then back correctly.

Only vecAdd<<<1,5>>>(Ad,Bd,Cd); doesn’t work.

I changed vecAdd to:
if (i < 5)
C[i]=6;

and it still returns garbage.

Anyone have any ideas what’s going wrong? The compiler doesn’t report any errors and I don’t see anything wrong with the code. It just seems garage results.

I can compile all the NVidia examples fine it seems in Visual Studio.
mytest.cpp (958 Bytes)
test.cu (687 Bytes)