C++ implementation 10x slower than pyCUDA

Hey,
a couple of months ago I wrote a pyCUDA program for calculating an integral image. Now I wanted to reuse the same code in a C++/CUDA project but I figured out that the same code is running 10x slower than before without changing any settings. The block and grid sizes are also the same. But I saw, that the used registers are 17 in the fast version and 22 in the slow one. I would expect nvcc to do the same compilation for both cases. Where am I wrong?
I guess the answer can be found somewhere in the cmd line.
C++/CUDA (in VS 2015):

Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files (x86)\Windows Kits\8.1”
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -ID:\OpenCVBuildFiles\newBuild\install\x64\vc14…\include -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart none -o x64\Debug%(Filename)%(Extension).obj “%(FullPath)”

Runtime API (NVCC Compilation Type is hybrid object or .c file)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files (x86)\Windows Kits\8.1”
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -ID:\OpenCVBuildFiles\newBuild\install\x64\vc14…\include -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart none -g -Xcompiler "/EHsc /nologo /FS /Zi " -o x64\Debug%(Filename)%(Extension).obj “%(FullPath)”

CMD line array extracted from pyCUDA:
[[‘nvcc’, ‘–cubin’, ‘-arch’, ‘sm_61’, ‘-m64’, ‘-Ic:\users\ee\appdata\local\continuum\anaconda2\lib\site-packages\pycuda\cuda’, ‘kernel.cu’], ‘c:\users\ee\appdata\local\temp\tmplzhkzg’]

Thanks a lot for your help.

Get rid of -G and -g. You want a release build, not a debug build.

Thank you. That fixed it. I didn’t expect that the difference is so huge though.