[NVCC BUG] Kernel seems to silently fail

Played with CUDA 10.2 matrix multiplication example. I have found that if I add variables like int ty2 = ty + 2 * blockDim.y;
and then use them, at least once:
As[ty2][tx] = A[a + wA * (ty + 2 * blockDim.y) + tx]; instead of
As[ty + 2 * blockDim.y][tx] = A[a + wA * (ty + 2 * blockDim.y) + tx];
the kernel seems to do nothing, no error returned with cudaStreamSynchronize and target device memory (d_C) stays zero.

The problem disappears in debug mode, without optimization, with other thread block size (command-line -b=16) or maxregcount=48.
Also if we run program with Nsight Visual Studio Edition CUDA Debugging (Next-gen), it reports cudaErrorLaunchOutOfResources(701) error.

Windows 10.0.19041, NVidia driver DCH 462.31, GeForce 2060 RTX 6 GBmatrixMul_vs2019.zip (172.2 KB)

Do you mean the issue only happens in Release mode and default optimization for you , but pass in Debug mode and optimization off ? cudaErrorLaunchOutOfResources(701) indicates too many arguments to the device kernel, or the kernel launch specifies too many threads for the kernel’s register count.

Yes, for me it only reproduced in Release with max registers 0 and default optimization