cuDSS is sometimes wrong where cuSparse and Umfpack succeed

We are currently trying to integrate cuDSS into one of our projects, but we find that it sometimes returns (spectacularly) wrong results for some systems.

Here is an example that works with the cudss simple example provided in the CUDALibraryExamples repository:

int i = 0;
int n = 17;
int nnz = 32;
int rhs = 1;

// ... mallocs ...

int csr_indptr_raw[] = {
    0, 1, 3, 5, 9, 10, 11, 14, 15, 17, 19, 20, 23, 25, 29, 30, 31, 32
};

for (i = 0; i < n+1; i++) {
    csr_offsets_h[i] = csr_indptr_raw[i];
}

int csr_indices_raw[] = {
    2, 2, 9, 0, 1, 0, 3, 13, 14, 5, 4, 6, 7, 11, 6, 9, 10, 1, 8, 8, 6, 11, 12, 11, 13, 0, 3, 12, 13, 3, 16, 15
};

for (i = 0; i < nnz; i++) {
    csr_columns_h[i] = csr_indices_raw[i];
}

double csr_values_raw[] = {
    -1000000000000.0, 1000000000000.0, -1000000000000.0, -1000000.0, 1000000.0, -4.26122904e-12, -10379765.7, 10379765.7, -1000000000000.0, 1000000000000.0, 1000000.0, -1000000000.0, 1000000000000.0, 1000000000.0, 1000000.0, 1000000000000.0, 1000000000000.0, -1000000.0, 1000000.0, 1000000.0, 1000000000.0, -1000000000.0, 1000000000000.0, 1000000.0, -1000000.0, 4.26122904e-12, 10379765.7, -1000000000000.0, -10379765.7, -1000000.0, 1000000000.0, 10000.0
};

for (i = 0; i < nnz; i++) {
    csr_values_h[i] = csr_values_raw[i];
}

// ...

cudssMatrixType_t mtype     = CUDSS_MTYPE_GENERAL;
cudssMatrixViewType_t mview = CUDSS_MVIEW_FULL;
cudssIndexBase_t base       = CUDSS_BASE_ZERO;

Anyway, if you try to solve the system above, cuDSS gives an answer that is totally wrong:

x[0] = 251979373537734098944.0000
x[1] = -0.4000
x[2] = -0.0000
x[3] = -0.0000
x[4] = 0.0000
x[5] = 0.0000
x[6] = -0.8001
x[7] = 576286421369295.5000
x[8] = -0.4000
x[9] = -0.0000
x[10] = 0.0000
x[11] = -576286421369238528.0000
x[12] = -91095040000000000000.0000
x[13] = -576286421369238528.0000
x[14] = -5981718029904.1699
x[15] = 0.0000
x[16] = 0.0000

Solving the same system using a dense solver (from numpy) gives

[-4.00000000e-01 -4.00000000e-01 -0.00000000e+00 -1.33671954e-16
  0.00000000e+00  0.00000000e+00 -8.00000000e-01 -8.21850639e-06
 -4.00000000e-01 -0.00000000e+00  0.00000000e+00 -7.91781494e-01
  8.21850639e-06 -7.91781494e-01 -8.21850639e-06  0.00000000e+00
  0.00000000e+00]

Using umfpack (I hope) via scipy with scikit-umfpack and suitesparse installed:

[-4.00000000e-01 -4.00000000e-01 -0.00000000e+00 -0.00000000e+00
  0.00000000e+00  0.00000000e+00 -8.00000000e-01 -8.21850639e-06
 -4.00000000e-01 -0.00000000e+00  0.00000000e+00 -7.91781494e-01
  8.21850639e-06 -7.91781494e-01 -8.21850639e-06  0.00000000e+00
  0.00000000e+00]

Using cusparse via jax.experimental.sparse.linalg.spsolve:

[-4.00000000e-01 -4.00000000e-01  1.74622983e-23 -1.03690234e-13
 -0.00000000e+00  0.00000000e+00 -8.00000000e-01 -8.21850639e-06
 -4.00000000e-01  5.82076609e-24  5.82076609e-24 -7.91781494e-01
  8.21850639e-06 -7.91781494e-01 -8.21850639e-06 -0.00000000e+00
  0.00000000e+00]

I can repeat these results on two separate machines with an RTX 4090 and an RTX 4070 Super respectively. I am using cuDSS 0.6

I tried setting CUDSS_CONFIG_REORDERING_ALG to CUDSS_ALG_1 and the results are much better:

x[0] = -0.4000000000
x[1] = -0.4000000000
x[2] = -0.0000000000
x[3] = -0.0000000000
x[4] = 0.0000000000
x[5] = 0.0000000000
x[6] = -0.8000000000
x[7] = -0.0000082185
x[8] = -0.4000000000
x[9] = -0.0000000000
x[10] = 0.0000000000
x[11] = -0.7917814936
x[12] = 0.0000082185
x[13] = -0.7917814936
x[14] = -0.0000082185
x[15] = 0.0000000000
x[16] = 0.0000000000

I guess if the answer is wrong just fiddle with the parameters until the result is correct?

Hi @otoomey

Since, as you checked, other solvers can handle the matrix, I assume the indexing data is correct and then I am pretty sure [didn’t check] that the reason for the bad accuracy is the presence of perturbed pivots.

By default, cuDSS has a pivoting epsilon of 1.0e-5 for single precision and 1e-13 for double precision so that if the values (after attempting the partial pivoting permutation) on the diagonal end up less than this value, then these values are considered too small and get replaced by the pivoting epsilon.

In fact, a similar default is used in some other solvers. On the other hand, experiments like yours should not be failing for the solver, so we’ll consider changing our default settings.

Obviously, for your example with a huge scale difference between matrix values this can lead to a very wrong result.

There are standard ways to mitigate this:

  • turn off the pivoting (e.g. by setting CUDSS_CONFIG_PIVOT_EPSILON to 0.0)
  • enable matching & scaling (via setting CUDSS_CONFIG_USE_MATCHING to 1)
  • do iterative refinement (by default cuDSS does zero steps of refinement): likely won’t help your case, the scale difference is just too much.
  • use a custom matrix pre-processing (e.g. scaling the matrix)

By switching to CUDSS_ALG_1, you likely didn’t get any pivots because it uses a global pivoting rather than a partial (local) pivoting.

You can confirm (or deny) what I said about perturbed pivots by getting the number of perturbed pivots via calling cudssDataGet() with CUDSS_DATA_NPIVOTS.

Thanks,
Kirill

Hello @kvoronin

Thanks for the advice. I’ve spent the last few days messing around with different configurations. It turns out the large values in the matrix were caused by a bug in our rhs prescaler, which we have since disabled. To get a better idea of performance and accuracy I’ve also switched to larger problem size, about 3900x3900. We are using Newton-raphson iterations in our simulator so I guess we get some amount of iterative refinement for free, but I’ve noticed that the default configuration (with default reordering algorithm) consistently requires more iterations. My hypothesis at the moment is that it is due to decreased accuracy. Our benchmark produces the following results for individual solver performance:

Matrix statistics:
NxN:  3849
NNZ:  534285
Sparsity: 3.61%
True NNZ:  533773
Maximum value:  2.011931016118471
Minimum value:  -24.894767305521764
====================================
Full linear solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:13<00:00,  1.82it/s]
Full linear evaluation time: 549.35 ms ± 5.28 ms, min 541.89 ms
MSE: min=0.0,  max=0.0,  mean=0.0
====================================
UMFPACK CPU sparse solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 29.21it/s]
UMFPACK evaluation time: 29.83 ms ± 0.34 ms, min 29.46 ms
MSE: min=1.39e-29,  max=1.39e-29,  mean=1.39e-29
====================================
Jax cuSparse sparse solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:02<00:00, 12.00it/s]
Jax cuSparse evaluation time: 79.99 ms ± 0.73 ms, min 78.52 ms
MSE: min=3.79e-26,  max=3.79e-26,  mean=3.79e-26
====================================
Jax cuDSS sparse solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 13.09it/s]
Jax cuDSS evaluation time: 74.31 ms ± 0.38 ms, min 73.31 ms
MSE: min=0.000657,  max=0.000663,  mean=0.00066
====================================
Jax full linear solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 18.27it/s]
Jax full linear evaluation time: 51.60 ms ± 0.87 ms, min 50.41 ms
MSE: min=3.36e-27,  max=3.36e-27,  mean=3.36e-27

The number of pivots is 1374. I am not familiar enough with solvers to say if that is many or few, perhaps you have some insight.

Config A: Different reordering:

CUDSS_CONFIG_REORDERING_ALG=CUDSS_ALG_1

I get:

Jax cuDSS sparse solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 23.70it/s]
Jax cuDSS evaluation time: 40.92 ms ± 0.89 ms, min 39.80 ms
MSE: min=1.31e-27,  max=1.31e-27,  mean=1.31e-27

Which is both faster and more accurate than the default METIS based algorithm.

Config B: Next, I tried disabling pivoting:

CUDSS_CONFIG_REORDERING_ALG=CUDSS_ALG_DEFAULT
CUDSS_CONFIG_PIVOT_EPSILON=0.0

Unfortunately, the solve failed, the result was just NaN.

Config C: Enable matching and scaling:

CUDSS_CONFIG_REORDERING_ALG=CUDSS_ALG_DEFAULT
CUDSS_CONFIG_USE_MATCHING=1

This succeeded:

Jax cuDSS sparse solver for 3849x3849 system.
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:01<00:00, 12.64it/s]
Jax cuDSS evaluation time: 76.87 ms ± 0.99 ms, min 75.10 ms
MSE: min=2.16e-28,  max=7.13e-28,  mean=4.21e-28

In this case we get 0 pivots.

Here is a profiler plot from the final configuration (Config C):

And here is the plot when using Config A:

The small red blocks are MemcpyD2H operations, for the analysis I assume. We are using JAX, so the trace is not terrible informative unfortunately. If I look at kernel stats I get 17.74 ms for cudss::main_loop_ker in Config A and 19.96 ms for cudss::map_ker in Config C.

Is this roughly the performance we can expect from this solver?

I’ve attached the matrix arrays below in case someone is interested in trying for themselves:

rhs.gz (4.2 KB)

A_values.gz (5.3 MB)

A_indptr.gz (10.5 KB)

A_indices.gz (160.4 KB)