CUDA Kernel doesn't execute all threads, stops after the 640th thread

This is my code to extract the diagonal from a COO format sparse matrix:

__global__ void diagonal_inverse_coo_kernel(int* in_row, int* in_col, double* in_value, double* out_value, int nonzeros, int* out_row, int* out_col) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if (idx < nonzeros) { // Ensure we don't go out of bounds
    if (in_row[idx] == in_col[idx]) {
        int diag_idx = in_row[idx];    // The diagonal index is the row index (or column, as they are equal for diagonal)
        out_row[diag_idx] = diag_idx;  // Store the diagonal position in out_row
        out_col[diag_idx] = diag_idx;  // Store the diagonal position in out_col
        out_value[diag_idx] = 1/in_value[idx];  // Store the inverted diagonal value directly (D^-1)
    }
}

}

I initialize the kernel with these parameters:

// CUDA kernel configuation parameters
dim3 blockSize(1024);
dim3 gridSize((matrixL_host->nonzeros + blockSize.x - 1) / blockSize.x);

The code always stops after the 640th thread, leaving all other entries as 0, regardless of the matrix size. Am I making a mistake with my initialization? Have I misunderstood blocksizes and gridsizes or how to handle the function-internel index? If so, why is 640 the relevant number in this case?

It is easier to help when people post a complete minimal reproducer rather than code snippets. You are sitting in front of the complete code, can instrument it and run it in the debugger. We cannot do that based on snippets.

What exactly do you mean by “stops after the 640th thread”? Clearly there are multiple data-driven guards restricting which outputs are being written to, so you would want to examine relevant input data. It also seems possible that there are race conditions where multiple threads write to the same element of the output arrays (at least I cannot convince myself on the double that this cannot happen).

I apologise for the limited code, but I have created a complex numerical solver, and a minimal working example would be several hundred lines long, just to extract the data from a COO matrix file and convert it into a format which the GPU can work with. I am not sure how to create a minimal working example for something so complex, though I am open to suggestions if you can instruct me on how to do so. I was hoping that this is an obvious issue. Also, the debugger is of no help here. No errors arise, and it is the lack of data/threads created which is the issue.

I mean that the first 640 values are properly extracted from the input matrix and stored to the output matrix, and after that, the GPU just stops. It no longer creates any threads to read the rest. They don’t save a zero to the output, the code simply does not create any threads after the 640th.

maybe nonzeros is 640
maybe your input sparse matrix doesn’t have any nonzero elements on the diagonal after row/col 640.

The code is certainly creating threads after 640, if it is creating any threads at all.

General debug suggestions might be useful. Put a printf statement in your kernel that prints out any time a value of 640 or greater is indicated for diag_idx. Use proper CUDA error checking. Run your code with compute-sanitizer.

I have used multiple datasets, both real-world and artificially created within my code for testing purposes of many sizes. Any matrices of 640x640 or smaller work fine. Any matrices with greater dimensions simply stop working, because the kernel does not launch more than 640 threads. Is this an issue with my blockSize and gridSize variables?

I have created the print statements, thats how I know only 640 threads are being created. It always stops after the 640th thread.

What should I look for in the compute sanitizer? It always runs through my code with 0 errors. I’m fairly certain the issue is in how I am launching the kernel.

If compute-sanitizer doesn’t report any issues, then you can disregard it as a debug suggestions. The thing to look for is error repoorts.

If you are convinced that there are no threads after the thread numbered with an idx of 640, then you could put a printf in your kernel before the if statement, that prints out any time an idx value of 640 or higher is generated. That would help to clarify that notion.

Your block size variable is 1024. If you are getting a successfull launch at all, you are getting at least 1024 threads. But due to the nature of sparse matrices and COO description, simply having 1024 threads may not be enough to give you all the non-zero diagonal values. I would assume that if you are launching your kernel with enough threads to cover the total amount of non-zero values, that should be enough to extract the diagonal.

Your grid and block calculations don’t appear to be obviously broken to me, nor do they indicate anything obvious that would lead to a number of 640. The grid calculation, at least, depends on numerical values that I don’t have. It’s possible that those numerical values could be a problem. If compute-sanitizer reports no trouble, and you do actually have a kernel launch for this kernel in your host code that is being encountered during code execution, then you are getting for sure at least 1 block launched, based on what I can see.

Stepping through code and inspecting data are key functionalities of debuggers, so I don’t know how one would arrive at such an assessment. Obviously one can also instrument the code and log the resulting output, which is my preferred method of debugging. Yes, I know it is old-fashioned and quaint, but it works and has helped me successfully debug issues where others had failed.

Thank you for the suggestion with the print statement. That extremely obvious thing to do somehow eluded me. I continually made the mistake of printing within the if statement. I feel foolish for that. You were correct, it turns out that the kernel is launching enough threads for the number of nonzeros elements within the matrix. Due to my misplacement of the print statement, I misinterpreted this to mean that the kernel did not launch enough threads.

Implementing that fix with the print statement allowed me to properly diagnose the problem. Thank you all for your help, and I apologise for having wasted your time with my own inadequate coding skills!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.