Why my ldmatrix PTX instruction is wrong?

It appears to me to be a code generation issue, i.e. a compiler problem. If we leave the break; statement commented out, and change the loop extent, I see varying behavior:

If I make it:

for (int i = 0; i < 2; i++)

I see expected printout. If I make it:

for (int i = 0; i < 3; i++)

I see all zeros. cc8.9 on CUDA 12.2 I can’t explain why that would be. compute-sanitizer reports no issues in any case. I also note that if I compile with -G with a loop extent of 3, I get expected output, rather than all zeros.

My suggestion is to retest on CUDA 12.4 (i.e. latest CUDA version available at the moment) and if the issue persists, then file a bug.

If you do file a bug, my suggestion is to strip out anything unnecessary in the code. The compiler warns that there are various unused variables, and there are even unused functions.