maybe nonzeros is 640
maybe your input sparse matrix doesn’t have any nonzero elements on the diagonal after row/col 640.
The code is certainly creating threads after 640, if it is creating any threads at all.
General debug suggestions might be useful. Put a printf
statement in your kernel that prints out any time a value of 640 or greater is indicated for diag_idx
. Use proper CUDA error checking. Run your code with compute-sanitizer
.