Get error cudaErrorLaunchOutOfResources from kernel after removing -G compiler option

I have Nvidia Geforce GTX 1070 card on my desktop with windows 10 and Visual Studio 2017. I made a CUDA C++ project to do some math computation. the project will generate a dll and exports some functions in dll. these exported functions are called from my main (GUI) C++ project. Both of debug and release versions of dll are working correctly until I change “Device/Generate GPU Debug Information” in CUDA C/C++ tab in project property from Yes (-g/-G) to No for both of debug and release versions. I was doing so because I noticed the major GPU function exported from dll runs slower in release version than debug version – the time checking is in main project side, so it is not threads sync problem. but after such change, both versions report error cudaErrorLaunchOutOfResources when invoking kennel function. why it is working before the setting change if it is because the function call takes too many registers? I also notice in the option for Host/Optimization is for both versions. do you know which option I should chose in order to get accurate and high performance result?

the kernel function looks like
global void ToolShootTriangleKernel(const tagTriIndices *d_pTriIndices, size_t triNum,
const tagXyz *d_pTriVertices, const tagXyz &d_toolCen, const tagToolSec *d_pToolShape,
int toolSecNum, tagAtomicVar *d_pToolCtrlZ)

it is called like
int threadsPerBlock = 512;
int blocksPerGrid = (int)((triNum + threadsPerBlock - 1) / threadsPerBlock);

ToolShootTriangleKernel <<<blocksPerGrid, threadsPerBlock>>> (d_pTriIndices,
triNum, d_pTriVertices, d_toolCen, d_pToolShape, toolSecNum, d_pToolCtrlZ);

thanks in advance for any help.

figure it out.

  1. “Register usage will also be affected if you are passing the -G switch to the compiler” from https://stackoverflow.com/questions/17913630/counting-registers-thread-in-cuda-kernel explaining why it works in debug not in release version.
  2. Cannot set “CUDA C/C++ – Max Used Register” to 0 unless launch_bounds is used in the code.
  3. To set “Generate GPU Debug Information” to No will double the speed in release version.