cusparseSpSV_solve function creates INF values

This is my function for a forwards-backwards substition solver. Unfortunately, my cusparseSpSV_solve function calls are setting the first value in the resulting vector to INF. Curiously, this only happens with matrices of dimensions 641x641 or greater. Anyone know why only the first value is being affected, and why this only happens with a certain matrix size?

void triangular_solve(cusparseHandle_t cusparseHandle, cusparseConstSpMatDescr_t matL, cusparseDnVecDescr_t vecZ, cusparseDnVecDescr_t vecRes, cusparseDnVecDescr_t vecTemp_a) {
size_t bufferSize = 0; // Declare variables for buffer size and buffer allocation
void* buffer = nullptr;
double alpha = 1.0; // Initialize alpha for scaling (usually set to 1.0)

// Forward solve: L * a = r
// Step 1: Create SpSV descriptor
cusparseSpSVDescr_t spSVDescr = NULL;
cusparseSpSV_createDescr(&spSVDescr);

// Step 2: Query buffer size for forward solve (L * a = r)
cusparseSpSV_bufferSize(cusparseHandle,
    CUSPARSE_OPERATION_NON_TRANSPOSE,  // opA
    &alpha,
    matL, // matrix A
    vecRes, // vector r
    vecTemp_a, // vector a (output)
    CUDA_R_64F,  // Use the appropriate data type (double precision in this case)
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr,
    &bufferSize);

// Step 3: Allocate memory for buffer
cudaMalloc(&buffer, bufferSize);

// Step 4: Perform analysis for forward solve
cusparseSpSV_analysis(cusparseHandle,
    CUSPARSE_OPERATION_NON_TRANSPOSE,
    &alpha,
    matL,
    vecRes,
    vecTemp_a,
    CUDA_R_64F,
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr,
    buffer);

// Step 5: Solve L * a = r (forward solve)
cusparseSpSV_solve(cusparseHandle,
    CUSPARSE_OPERATION_NON_TRANSPOSE,
    &alpha,
    matL,
    vecRes,
    vecTemp_a,
    CUDA_R_64F,
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr);

// Backward solve: L^T * z = a
// Step 6: Query buffer size for backward solve (L^T * z = a)
cusparseSpSV_bufferSize(cusparseHandle,
    CUSPARSE_OPERATION_TRANSPOSE,
    &alpha,
    matL, // matrix A
    vecTemp_a,  // vector a (input)
    vecZ, // vector z (output)
    CUDA_R_64F,
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr,
    &bufferSize);

// Step 7: Reallocate buffer if necessary (if backward solve needs more memory)
cudaFree(buffer);
cudaMalloc(&buffer, bufferSize);

// Step 8: Perform analysis for backward solve
cusparseSpSV_analysis(cusparseHandle,
    CUSPARSE_OPERATION_TRANSPOSE,
    &alpha,
    matL,
    vecTemp_a,
    vecZ,
    CUDA_R_64F,
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr,
    buffer);

// Step 9: Solve L^T * z = a (backward solve)
cusparseSpSV_solve(cusparseHandle,
    CUSPARSE_OPERATION_TRANSPOSE,
    &alpha,
    matL,
    vecTemp_a,
    vecZ,
    CUDA_R_64F,
    CUSPARSE_SPSV_ALG_DEFAULT,
    spSVDescr);

// Clean up: free buffer and destroy the descriptor
cudaFree(buffer);
cusparseSpSV_destroyDescr(spSVDescr);

}

Does this happen completely independent of the data in the matrices themselves?

1 Like

This occurs if two conditions are met: The matrix does not consist entirely of values on the main diagonal, that is there must be values outside of the main diagonal (I am passing a lower triangular matrix into the function, in case that is relevant), and the matrix has dimensions of 641x641 or greater. The data in the matrix itself does not seem to influence the problem. I created matrices with randomly generate values, matrices with fixed values, and read matrices from real-world FEA problems. Same issue in every case.

Generally speaking, the underlying problem could be an

(1) algorithmic error, i.e. the cuSparse functions are not being combined correctly to affect the solve

(2) programming error, e.g. failed allocations, incorrect arguments passed to cuSparse functions

(3) a bug in cuSparse

I have not used cuSparse and my experience with solvers is fairly minimal. To my knowledge, an INF result from a solver could mean there is no solution. For example, when one tries to allocate resources to activities minimizing cost, an INF result could indicate the cost is infinite, meaning there are no resources available for that activity. If your code delivers correct results for all matrices smaller than 641x641, and returns INF for all larger matrices regardless of data, it seems reasonable to conclude that the problem is not an algorithmic one.

The first check for programming errors would be to add error checking to all calls of cuSparse and CUDA functions. Is any of these calls failing with a status other than “success”? If that does not yield anything, run your program under control of compute-sanitizer. Maybe there is an insufficiently sized allocation, or an access out of bounds. While compute-sanitizer cannot find all possible problems, it can find a good many of them. If possible, also do a peer review, where one focus could be validation of function argument lists. In my experience, when one deals with functions taking many arguments, it is easy to make a mistake, e.g. flip two pointer arguments.

Bugs in NVIDIA libraries are possible, of course, but the older libraries like cuSparse can be considered mature, meaning the likelihood of bugs still lurking in them is small. Therefore one’s first thought should go to ferret out the bugs in one’s own code.

I use a cuda error check wrapper on all my cuda function calls, I simply removed them here as to make it easier for the reader to understand what the code does. Unfortunately, no error was returned.

I also used the compute sanitizer. It completes my code with 0 errors. I find this odd, as I would expect something in the CUDA library or the sanitizer to return an INF value as an error.

I keep coming back to it being a bug I introduced in my own code as well, but then I can’t understand why it would only show up on matrices of a certain dimension or larger.

One could speculate. Maybe there are tiles / panels of 64x64 elements being used, and using more than 10 of these causes issues. That kind of bug could be in your code or in cuSparse. If one suspects the latter, one could try some quick speculative experiments, such as handing cuSparse a buffer that is larger than it claims it needs.

In my experience it is also possible to mischaracterize the triggers for an issue, leading the search for possible root causes down the wrong path. Unfortunately 641x641 is pretty large for debugging purposes. With matrix functions I usually try to reproduce issues with minimal matrix sizes, say 3x3 or 4x4, to thoroughly inspect all intermediate data. Not sure what is possible here. If you have experience with such solves, you may be able to spot unexpected data patterns in the intermediate data before the INF is being computed.

Cross post:

Is the rest of the result vector correct? Can you make the first value redundant or trivial by modifying the input equations?

I made a mistake which was pointed out to me by a colleague. I misdiagnosed the problem, the issue lies with a CUDA kernel stopping its executions after the 640th thread, even if it should execute many thousands more. I will be opening a new topic, but thank you all the help. I apologise for having wasted your time due to my own incompetence.