Kernel runs with NSight, cudaErrorLaunchFailure (4) without

I have a mex function which runs in NSight when env var NSIGHT_CUDA_DEBUGGER = 1 and I attach NSight to the mex function or w/o NSight.

But if I have NSIGHT_CUDA_DEBUGGER = 0, and run it, I get cudaErrorLaunchFailure (4) with or w/o VS debugging.

Why would a kernel run correctly with NSIGHT_CUDA_DEBUGGER = 1 and fail to launch with NSIGHT_CUDA_DEBUGGER = 0?

OK, it seems that it is related to insufficient shared memory size being declared.

Don’t know why it ran with NSIGHT_CUDA_DEBUGGER = 1