CUDA Kernel doesn't execute all threads, stops after the 640th thread

maybe nonzeros is 640
maybe your input sparse matrix doesn’t have any nonzero elements on the diagonal after row/col 640.

The code is certainly creating threads after 640, if it is creating any threads at all.

General debug suggestions might be useful. Put a printf statement in your kernel that prints out any time a value of 640 or greater is indicated for diag_idx. Use proper CUDA error checking. Run your code with compute-sanitizer.