CUDA build rule v4.0 VS v3.0 for MS visual studio

Hi all,

I found a strange thing on CUDA build rule 4.0.
Originally, I made some code with CUDA 3.2 and it shows good performance.
And it also works well with CUDA 4.0.

However, when I just change the CUDA build rule version from 3.0 to 4.0, the performance is significantly drops.
And I found that the different versions of build rules generate different command lines.

Form CUDA build rule v3.0.14
“C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_13,code=“sm_13,compute_13” -gencode=arch=compute_20,code=“sm_20,compute_20” -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -use_fast_math -I"C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include" -I"./" -I"…/…/common/inc" -I"…/…/…/shared/inc" -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\common\inc" -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\common\inc” -maxrregcount=32 -gencode=arch=compute_13,code=“sm_13,compute_13” -gencode=arch=compute_20,code=“sm_20,compute_20” --compile -o “Release\kernels.cu.obj” “e:\Users\Administrator\Desktop\HPDQ\HPHC_PQ\kernels.cu”

From CUDA build rule v4.0
“C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_13,code=“sm_13,compute_13” -gencode=arch=compute_20,code=“sm_20,compute_20” --machine 32 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -use_fast_math -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\common\inc” -I"C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include" -maxrregcount=32 --compile -o “Release/HPHC_PQ.vcproj.obj” “e:\Users\Administrator\Desktop\HPDQ\HPHC_PQ\HPHC_PQ.vcproj”

It is hard for me to find main difference between two command lines. Also, I’d like to know the reason that makes the performance differences.
Could you help me? :)