cuda the launch timed out and was terminated

Launch timeouts normally occur because the kernel is taking too long to run on a GPU which has an active display. The driver will kill kernels taking more than a few seconds to complete. The reason why commenting out that line allows the kernel of complete without the timeout is because without the global memory writes, most of the kernel code will be removed by compiler optimsation, leaving you with an empty kernel.

The solution is to reduce the kernel execution time, either by doing less work per kernel call or improving the code efficiency, or some combination of both. The othe alternative is to use a dedicated compute card, which eliminates the display driver time limit altogether.