cusparseSpSV_solve function extremely slow

I have implemented a PCG method with CUDA. After timing it, I have found that 99% of my time is spent in a function which implements the cusparseSpSV_solve function set, and was created to perform a forwards-backwards solve. This results in my preconditioned method taking 16 times longer than my non-preconditioned method, regardless of which preconditioner I use. The issue is not with the results, but the time taken to perform a single forwards-backwards solve, so it is not about data formatting.

I am using matrices with sparsities of 99.7% or greater. The matrices are converted into CSR format before being passed into the function. Can someone tell me why this particular CUDA function is so slow? Have I implemented it poorly? Is there an alternative function set?

// This function performs a forwards-backwards solve with a lower triangular matrix using SpSV function set.
void triangular_solve(cusparseHandle_t cusparseHandle, cusparseConstSpMatDescr_t matL, cusparseDnVecDescr_t vecZ, cusparseDnVecDescr_t vecRes, cusparseDnVecDescr_t vecTemp_a) {
    size_t bufferSize = 0; // Declare variables for buffer size and buffer allocation
    void* buffer = nullptr;
    double alpha = 1.0; // Initialize alpha for scaling (usually set to 1.0)

    // Forward solve: L * a = r
    // Step 1: Create SpSV descriptor
    cusparseSpSVDescr_t spSVDescr = NULL, spSVDescr2 = NULL;
    cusparseSpSV_createDescr(&spSVDescr);
    cusparseSpSV_createDescr(&spSVDescr2);

    // Step 2: Query buffer size for forward solve (L * a = r)
    cusparseSpSV_bufferSize(cusparseHandle,
        CUSPARSE_OPERATION_NON_TRANSPOSE,  // opA
        &alpha,
        matL, // matrix A
        vecRes, // vector r
        vecTemp_a, // vector a (output)
        CUDA_R_64F,  // Use the appropriate data type (double precision in this case)
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr,
        &bufferSize);

    // Step 3: Allocate memory for buffer
    cudaMalloc(&buffer, bufferSize);

    // Step 4: Perform analysis for forward solve
    cusparseSpSV_analysis(cusparseHandle,
        CUSPARSE_OPERATION_NON_TRANSPOSE,
        &alpha,
        matL,
        vecRes,
        vecTemp_a,
        CUDA_R_64F,
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr,
        buffer);

    // Step 5: Solve L * a = r (forward solve)
    cusparseSpSV_solve(cusparseHandle,
        CUSPARSE_OPERATION_NON_TRANSPOSE,
        &alpha,
        matL,
        vecRes,
        vecTemp_a,
        CUDA_R_64F,
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr);

    // Backward solve: L^T * z = a
    // Step 6: Query buffer size for backward solve (L^T * z = a)
    cusparseSpSV_bufferSize(cusparseHandle,
        CUSPARSE_OPERATION_TRANSPOSE,
        &alpha,
        matL, // matrix A
        vecTemp_a,  // vector a (input)
        vecZ, // vector z (output)
        CUDA_R_64F,
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr2,
        &bufferSize);

    // Step 7: Reallocate buffer if necessary (if backward solve needs more memory)
    cudaFree(buffer);
    cudaMalloc(&buffer, bufferSize);

    // Step 8: Perform analysis for backward solve
    cusparseSpSV_analysis(cusparseHandle,
        CUSPARSE_OPERATION_TRANSPOSE,
        &alpha,
        matL,
        vecTemp_a,
        vecZ,
        CUDA_R_64F,
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr2,
        buffer);

    // Step 9: Solve L^T * z = a (backward solve)
    cusparseSpSV_solve(cusparseHandle,
        CUSPARSE_OPERATION_TRANSPOSE,
        &alpha,
        matL,
        vecTemp_a,
        vecZ,
        CUDA_R_64F,
        CUSPARSE_SPSV_ALG_DEFAULT,
        spSVDescr2);

    // Clean up: free buffer and destroy the descriptor
    cudaFree(buffer);
    cusparseSpSV_destroyDescr(spSVDescr);
    cusparseSpSV_destroyDescr(spSVDescr2);
}

Hi @SBenner,
The triangular_solve function’s performance suffers if cusparseSpSV_analysis is called repeatedly during each precondition step, as it incurs unnecessary overhead. By isolating the analysis step into a preprocessing phase, you can significantly reduce runtime by only needing to reuse the descriptors in each solve step.

The right approach involves two main parts:

  1. Preprocessing: Run cusparseSpSV_bufferSize and cusparseSpSV_analysis once to generate and store the lower and upper matrix descriptors.
  2. Solve Phase: In the actual solve or precondition step, call cusparseSpSV_solve with the precomputed descriptors. This avoids recalculating the analysis data and allows you to reuse the structures, speeding up each solve.

The NVIDIA HPCG benchmark illustrates this best practice:

  • In the OptimizeProblem.cpp file, the lower and upper matrix descriptors are created once in the optimization phase, which avoids redundant processing (link).
  • The ComputeSYMGS.cpp file then reuses these descriptors in each call to the preconditioner (link).

This refactoring should make your triangular_solve function significantly faster and more efficient.

For more info please refer to the cusparseSpSV documentation (link).

Thanks,

Thanks for the suggestion, I implemented that change. While I can refactor my code to run only the buffersize calculations once, if I do the analysis only once before beginning iteration, it fails to converge. Is that to be expected? I can’t find a real difference between my new code and what you are doing between cusparseSpSV_solve calls, other than a call to SpmvDiagCuda, but I’m not sure what that is and I can’t find a definition. Is there something I need to consider when moving the analysis call out of the function I listed above?

I see now that in the example you linked you are using CUSPARSE_OPERATION_NON_TRANSPOSE for both of your operations, but I need the transpose version for the second operation. Would I be correct in assuming that is why I can’t run the analysis function just once?

Thank you for your suggestions, and your code examples. With it, I was able to piece together my errors. I had to create two descriptors, one for the forward solve and one for the backwards solve as one is non-transpose and the other is transpose.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.