Partial factored matrix in cuDSS

Hi,

I need to solve Ax=b. Matrix A is a sparse symmetric positive definite (SPD) matrix. Therefore, I am using Cholesky factorization. Since I am using this factorization in a time stepping loop there are instances that the A matrix changes.

Now the non zero pattern doesn’t change at time stepping and therefore, i do not need to do the symbolic factorization within the loop. However, when the changes occur i have to do the numeric factorization. Testing this approach with many available GPU and CPU packages I found cuDSS Cholesky factorization is the fastest. However, my question is whether it is possible to retain a part of the L factored matrix. So lets say if at time stepping only the non zero elements of last 10 rows of the matrix changes (not to zero but to other value keeping the SPD structure) of a 10000 by 10000 matrix, can we retain the previously calculated upper portion of the L matrix intact. Thereby reducing the time at factorization more.

Hi @aluthged!

We’re glad to hear that you are successfully using cuDSS. What you’re asking for is a known feature in general for sparse direct solvers. However, it hasn’t been implemented in cuDSS yet.

If you coudl tell us more about your application (here or via our contact email), it would help us prioritize adding this feature.

Right now, what you can do is implement a block 2x2 approach,

A B

B^T C

where the last block C is corresponding to the rows/cols where the matrix is changing. Then, you can use cuDSS to factorize the non-changing square part = top left block (for this you will need to have a separate CSR matrix for block A), and then basically compute the Schur complement of it, solve the tiny system for the Schur complement(10x10 in your example), and then find the rest of the solution vector using a couple of matrix-vector products and cuDSS solve.

Then, for the new iteration (time step) where C changes, you only need to update this small block.

Also, if the matrix changes are confined to just C (and do not involve B and B^T), then you can uyse the Schur complement feature of cuDSS. Then you don’t need to have a separate CSR for A, you just compute the Schur complement C - B^T A^-1 B using cuDSS and then do the same as I described above.

Let me know if this idea makes sense to you, or you want to discuss further details.

Thanks,
Kirill

Hi kvoronin,

Firstly thanks for the prompt reply.

This makes sense. Usually the part you refer as B too changes, but we can localize it to C with some “tricks”. So as per your suggestion if we use Schur’s complement instead of the numerical factorization, there is a possibility to make the program faster? Let me try it and will get back to you with the results.

We are working on Electromagnetic Transient Simulations (EMT). A electrical power system analysis tools such as PSCAD, RTDS, EMTP, OpalRT etc. solves Ax=b in their engines. However, due to the increase of all these renewable integrations (solar and wind) we have a hard time (takes a long time) to solved matrix equation. It is because historically the matrix A was a constant or at least the major part of the matrix. But now due to the integration of solar and wind that doesn’t stay a constant at each time step. So we planned to develop a fully GPU based EMT solver.

Initially, we tried to communicate to the GPU at each time step and just to do the solution. But it failed due to communication overheads. But now we are minimizing communication with the CPU and modelling part of the network on the GPU, but still we feel like it would be better if we could use the factored part so that we can make the simulation even faster. This is the reason why partial factorization is needed.

Thank you.

With regards,
Devin

Hi Devin!

Thanks for sharing the context!

About my idea from the previous message: I think it should help in your case, because then you can factorize the majority of the matrix once and the cost of re-doing the factorization should be drastically reduced.

Let us know if you have any issues when trying this approach.

With regards,
Kirill

This showed a speedup of 2 times in some cases however, I kind of found a bug or some inconsistency in you provided Schur’s complement. If you take the following 4 steps for example:

/* Step 1: Forward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_FWD_PERM | CUDSS_PHASE_SOLVE_FWD, solverConfig, solverData, A, x, b),status, “cudssExecute for a partial forward solve up to the Schur complement”);

/* Step 2: Diagonal solve (necessary only if matrix type is symmetric or Hermitian, since cuDSS for these matrix types performs LDL^T(H) factorization) */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_DIAG, solverConfig, solverData, A, b, x), status,“cudssExecute for a partial diagonal solve (excluding the Schur complement)”);

/* Step 3: Solve the Schur complement system with an external solver */
int dense_solve_error = dense_matrix_solve(schur_shape[0], 1, schur_ld, schur_matrix_values_d, b_values_d + (n - schur_shape[0]), n, stream);
if (dense_solve_error != 0) {
printf(“Error: dense matrix solve failed\n”);
CUDSS_EXAMPLE_FREE;
return dense_solve_error;
}

/* Since in this example the Schur complement matrix is extracted as a dense matrix, one can use a dense solver from LAPACK (e.g. using cuSOLVER)*/

/* Step 4: Do the backward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM, solverConfig, solverData, A, x,b), status, “cudssExecute for a partial backward solve”);

Shouldn’t these bold parameters be x rather than b?

The second issue which I have is at time stepping I am running the the mentioned four steps. I change the RHS at each time step. Now the answers are incorrect. The general cuDSS doesn’t make this error. No matter how much you change the RHS it gives the correct solved values.

Initially I thought that it had to do with the permutation and I added the following before step 1. cudaMemcpy(b_values_d, b_values_d1, n * sizeof(double), cudaMemcpyDeviceToDevice);

Where b_values_d1 is the the non permuted vector. However, though by introducing this the values changed it weren’t the correct values. I must say this “sort of” rectifies the issue in the example case that you’ve provided in github. This messes up large matrices. I am working with a sparse matrix which is 1612x1612. Even for the smaller matrix that is provided in the example this is close to the right answer but not the right answer. Following are the changes that I made to the code of the example that you’ve provided in github

for (int i = 0; i < 5; i++) {
    cudaMemcpy(b_values_d, b_values_d1, n * sizeof(double), cudaMemcpyDeviceToDevice);
    /* As mentioned above, in case when the Schur complement is computed,
       numerical factorization is done only partially. Therefore, one cannot
       solve the full system with just one call to the solve phase.
       Instead, one needs to perform the following steps:
       1. Do the forward solve up to the Schur complement
       2. Do the diagonal solve (necessary only if matrix type is symmetric or Hermitian,
          since cuDSS for these matrix types performs LDL^T(H) factorization)
       3. Solve the Schur complement system with an external solver
       4. Do the backward solve up to the Schur complement
     */
     /* Step 1: Forward solve up to the Schur complement */
    CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_FWD_PERM | CUDSS_PHASE_SOLVE_FWD,
            solverConfig, solverData, A, x, b), status, "cudssExecute for a partial forward solve up to the Schur complement");

    /* Step 2: Diagonal solve (necessary only if matrix type is symmetric or Hermitian,
       since cuDSS for these matrix types performs LDL^T(H) factorization) */
    CUDSS_CALL_AND_CHECK(
        cudssExecute(handle, CUDSS_PHASE_SOLVE_DIAG, solverConfig, solverData, A, x, x),
        status, "cudssExecute for a partial diagonal solve (excluding the Schur complement)");

    /* Step 3: Solve the Schur complement system with an external solver */
    int dense_solve_error =
        dense_matrix_solve(schur_shape[0], nrhs, schur_ld, schur_matrix_values_d,
           x_values_d + (n - schur_shape[0]), n, stream);
    if (dense_solve_error != 0) {
        printf("Error: dense matrix solve failed\n");
        CUDSS_EXAMPLE_FREE;
        return dense_solve_error;
    }

    /* Since in this example the Schur complement matrix is extracted as a dense matrix,
       one can use a dense solver from LAPACK (e.g. using cuSOLVER)*/

       /* Step 4: Do the backward solve up to the Schur complement */
    CUDSS_CALL_AND_CHECK(cudssExecute(handle,
        CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM, solverConfig, solverData, A, x, x),
        status, "cudssExecute for a partial backward solve");

    /* Print the solution and compare against the exact solution */
    CUDA_CALL_AND_CHECK(cudaStreamSynchronize(stream), "cudaStreamSynchronize");

    CUDA_CALL_AND_CHECK(cudaMemcpy(x_values_h, x_values_d, nrhs * n * sizeof(double),
        cudaMemcpyDeviceToHost), "cudaMemcpy for x_values");
    CUDA_CALL_AND_CHECK(cudaMemcpy(b_values_h1, b_values_d, nrhs * n * sizeof(double),
        cudaMemcpyDeviceToHost), "cudaMemcpy for b_values");

    printf("Iteration number = %d \n", i);
    for (int a = 0; a < n; a++) {
        printf("x[%d] = %1.4f  b[%d] = %1.4f\n", a, x_values_h[a], a, b_values_h1[a]);
    }
}

And the answers for the 5 steps are:

CUDSS Version (Major,Minor,PatchLevel): 0.7.1
schur shape: nrows = 2, ncols = 2, sparse nnz (for a general matrix) = 4
schur_matrix_values[0][0] = 2.1579
schur_matrix_values[0][1] = -0.4211
schur_matrix_values[1][0] = -0.4211
schur_matrix_values[1][1] = 1.7895
Iteration number = 0
x[0] = 1.0000 b[0] = 7.0000
x[1] = 2.0000 b[1] = 12.0000
x[2] = 3.0000 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 5.0000 b[4] = 13.0000
Iteration number = 1
x[0] = 0.9982 b[0] = 7.0000
x[1] = 1.9952 b[1] = 12.0000
x[2] = 3.0072 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9754 b[4] = 13.0000
Iteration number = 2
x[0] = 0.9973 b[0] = 7.0000
x[1] = 1.9929 b[1] = 12.0000
x[2] = 3.0107 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9636 b[4] = 13.0000
Iteration number = 3
x[0] = 0.9969 b[0] = 7.0000
x[1] = 1.9918 b[1] = 12.0000
x[2] = 3.0123 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9581 b[4] = 13.0000
Iteration number = 4
x[0] = 0.9967 b[0] = 7.0000
x[1] = 1.9913 b[1] = 12.0000
x[2] = 3.0130 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9555 b[4] = 13.0000

Can I know if it is bug or not?

Thank you for the corporation.