This showed a speedup of 2 times in some cases however, I kind of found a bug or some inconsistency in you provided Schur’s complement. If you take the following 4 steps for example:
/* Step 1: Forward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_FWD_PERM | CUDSS_PHASE_SOLVE_FWD, solverConfig, solverData, A, x, b),status, “cudssExecute for a partial forward solve up to the Schur complement”);
/* Step 2: Diagonal solve (necessary only if matrix type is symmetric or Hermitian, since cuDSS for these matrix types performs LDL^T(H) factorization) */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_DIAG, solverConfig, solverData, A, b, x), status,“cudssExecute for a partial diagonal solve (excluding the Schur complement)”);
/* Step 3: Solve the Schur complement system with an external solver */
int dense_solve_error = dense_matrix_solve(schur_shape[0], 1, schur_ld, schur_matrix_values_d, b_values_d + (n - schur_shape[0]), n, stream);
if (dense_solve_error != 0) {
printf(“Error: dense matrix solve failed\n”);
CUDSS_EXAMPLE_FREE;
return dense_solve_error;
}
/* Since in this example the Schur complement matrix is extracted as a dense matrix, one can use a dense solver from LAPACK (e.g. using cuSOLVER)*/
/* Step 4: Do the backward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM, solverConfig, solverData, A, x,b), status, “cudssExecute for a partial backward solve”);
Shouldn’t these bold parameters be x rather than b?
The second issue which I have is at time stepping I am running the the mentioned four steps. I change the RHS at each time step. Now the answers are incorrect. The general cuDSS doesn’t make this error. No matter how much you change the RHS it gives the correct solved values.
Initially I thought that it had to do with the permutation and I added the following before step 1. cudaMemcpy(b_values_d, b_values_d1, n * sizeof(double), cudaMemcpyDeviceToDevice);
Where b_values_d1 is the the non permuted vector. However, though by introducing this the values changed it weren’t the correct values. I must say this “sort of” rectifies the issue in the example case that you’ve provided in github. This messes up large matrices. I am working with a sparse matrix which is 1612x1612. Even for the smaller matrix that is provided in the example this is close to the right answer but not the right answer. Following are the changes that I made to the code of the example that you’ve provided in github
for (int i = 0; i < 5; i++) {
cudaMemcpy(b_values_d, b_values_d1, n * sizeof(double), cudaMemcpyDeviceToDevice);
/* As mentioned above, in case when the Schur complement is computed,
numerical factorization is done only partially. Therefore, one cannot
solve the full system with just one call to the solve phase.
Instead, one needs to perform the following steps:
1. Do the forward solve up to the Schur complement
2. Do the diagonal solve (necessary only if matrix type is symmetric or Hermitian,
since cuDSS for these matrix types performs LDL^T(H) factorization)
3. Solve the Schur complement system with an external solver
4. Do the backward solve up to the Schur complement
*/
/* Step 1: Forward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle, CUDSS_PHASE_SOLVE_FWD_PERM | CUDSS_PHASE_SOLVE_FWD,
solverConfig, solverData, A, x, b), status, "cudssExecute for a partial forward solve up to the Schur complement");
/* Step 2: Diagonal solve (necessary only if matrix type is symmetric or Hermitian,
since cuDSS for these matrix types performs LDL^T(H) factorization) */
CUDSS_CALL_AND_CHECK(
cudssExecute(handle, CUDSS_PHASE_SOLVE_DIAG, solverConfig, solverData, A, x, x),
status, "cudssExecute for a partial diagonal solve (excluding the Schur complement)");
/* Step 3: Solve the Schur complement system with an external solver */
int dense_solve_error =
dense_matrix_solve(schur_shape[0], nrhs, schur_ld, schur_matrix_values_d,
x_values_d + (n - schur_shape[0]), n, stream);
if (dense_solve_error != 0) {
printf("Error: dense matrix solve failed\n");
CUDSS_EXAMPLE_FREE;
return dense_solve_error;
}
/* Since in this example the Schur complement matrix is extracted as a dense matrix,
one can use a dense solver from LAPACK (e.g. using cuSOLVER)*/
/* Step 4: Do the backward solve up to the Schur complement */
CUDSS_CALL_AND_CHECK(cudssExecute(handle,
CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM, solverConfig, solverData, A, x, x),
status, "cudssExecute for a partial backward solve");
/* Print the solution and compare against the exact solution */
CUDA_CALL_AND_CHECK(cudaStreamSynchronize(stream), "cudaStreamSynchronize");
CUDA_CALL_AND_CHECK(cudaMemcpy(x_values_h, x_values_d, nrhs * n * sizeof(double),
cudaMemcpyDeviceToHost), "cudaMemcpy for x_values");
CUDA_CALL_AND_CHECK(cudaMemcpy(b_values_h1, b_values_d, nrhs * n * sizeof(double),
cudaMemcpyDeviceToHost), "cudaMemcpy for b_values");
printf("Iteration number = %d \n", i);
for (int a = 0; a < n; a++) {
printf("x[%d] = %1.4f b[%d] = %1.4f\n", a, x_values_h[a], a, b_values_h1[a]);
}
}
And the answers for the 5 steps are:
CUDSS Version (Major,Minor,PatchLevel): 0.7.1
schur shape: nrows = 2, ncols = 2, sparse nnz (for a general matrix) = 4
schur_matrix_values[0][0] = 2.1579
schur_matrix_values[0][1] = -0.4211
schur_matrix_values[1][0] = -0.4211
schur_matrix_values[1][1] = 1.7895
Iteration number = 0
x[0] = 1.0000 b[0] = 7.0000
x[1] = 2.0000 b[1] = 12.0000
x[2] = 3.0000 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 5.0000 b[4] = 13.0000
Iteration number = 1
x[0] = 0.9982 b[0] = 7.0000
x[1] = 1.9952 b[1] = 12.0000
x[2] = 3.0072 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9754 b[4] = 13.0000
Iteration number = 2
x[0] = 0.9973 b[0] = 7.0000
x[1] = 1.9929 b[1] = 12.0000
x[2] = 3.0107 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9636 b[4] = 13.0000
Iteration number = 3
x[0] = 0.9969 b[0] = 7.0000
x[1] = 1.9918 b[1] = 12.0000
x[2] = 3.0123 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9581 b[4] = 13.0000
Iteration number = 4
x[0] = 0.9967 b[0] = 7.0000
x[1] = 1.9913 b[1] = 12.0000
x[2] = 3.0130 b[2] = 25.0000
x[3] = 4.0000 b[3] = 4.0000
x[4] = 4.9555 b[4] = 13.0000
Can I know if it is bug or not?
Thank you for the corporation.