matrixMul compiled is slower than binary that is included with cuda 5.0 SDK

I compiled the matrixMul example that is provided using VS2010 C++ express with windows SDK 7.1 into a 64 bit exe, but the problem is that my compiled version is around 11.5 times slower than the include exe. I used the profiler, and the problem appears to be the DRAM utilization. Does any know what would cause this? I’m running windows 7 64 bit.

Included exe:
run time = 4.653 msec
DRAM Utilization = 6.3% (1.73 GB/s)

My compiled exe:
run time = 53.543 msec
DRAM Utilization = 0.5% (152.12 MB/s)

Note: I also compiled a 32 bit version and a similar performance gap between my 32 bit binary and the included 32 bit binary is large.

You didn’t state if you actually chose Debug or Relase. You will want to do a release build for sure.

I was using Debug. As soon as I compile as Release, the performance is what I expect. Thanks for the help!!