I have Nvidia Geforce GTX 1070 card on my desktop with windows 10 and Visual Studio 2017. I made a CUDA C++ project to do some math computation. the project will generate a dll and exports some functions in dll. these exported functions are called from my main (GUI) C++ project. Both of debug and release versions of dll are working correctly until I change “Device/Generate GPU Debug Information” in CUDA C/C++ tab in project property from Yes (-g/-G) to No for both of debug and release versions. I was doing so because I noticed the major GPU function exported from dll runs slower in release version than debug version – the time checking is in main project side, so it is not threads sync problem. but after such change, both versions report error cudaErrorLaunchOutOfResources when invoking kennel function. why it is working before the setting change if it is because the function call takes too many registers? I also notice in the option for Host/Optimization is for both versions. do you know which option I should chose in order to get accurate and high performance result?
the kernel function looks like
global void ToolShootTriangleKernel(const tagTriIndices *d_pTriIndices, size_t triNum,
const tagXyz *d_pTriVertices, const tagXyz &d_toolCen, const tagToolSec *d_pToolShape,
int toolSecNum, tagAtomicVar *d_pToolCtrlZ)
it is called like
int threadsPerBlock = 512;
int blocksPerGrid = (int)((triNum + threadsPerBlock - 1) / threadsPerBlock);
ToolShootTriangleKernel <<<blocksPerGrid, threadsPerBlock>>> (d_pTriIndices,
triNum, d_pTriVertices, d_toolCen, d_pToolShape, toolSecNum, d_pToolCtrlZ);
thanks in advance for any help.