Played with CUDA 10.2 matrix multiplication example. I have found that if I add variables like int ty2 = ty + 2 * blockDim.y;
and then use them, at least once:
As[ty2][tx] = A[a + wA * (ty + 2 * blockDim.y) + tx]; instead of
As[ty + 2 * blockDim.y][tx] = A[a + wA * (ty + 2 * blockDim.y) + tx];
the kernel seems to do nothing, no error returned with cudaStreamSynchronize and target device memory (d_C) stays zero.
The problem disappears in debug mode, without optimization, with other thread block size (command-line -b=16) or maxregcount=48.
Also if we run program with Nsight Visual Studio Edition CUDA Debugging (Next-gen), it reports cudaErrorLaunchOutOfResources(701) error.
Windows 10.0.19041, NVidia driver DCH 462.31, GeForce 2060 RTX 6 GBmatrixMul_vs2019.zip (172.2 KB)