Wrong result with cuDSS

Dear All,

I’ve test cuDSS v0.5 on A40 with 5x5 sparse matrix contain very small scientific number like e-115 but I got wrong result. Please take a look at below input matrix, rhs vector and result. I made exact CSR format and matrix type to GENERAL and view type to FULL accordingly. Are there any way to handle it?

matrix :
1 1 -5.9211058216815069e-115
1 3 -1.5307239967446657e-22
2 2 1.1851126493134462e-114
2 3 3.0637526600302602e-22
3 3 -2.9650103357264466e-115
3 5 -7.6651433164278924e-23
4 4 -5.9300206714528959e-115
5 5 -1.5330286632855794e-22

RHS vector :
7.0
12.0
25.0
4.0
13.0

cuDSS Result :
249999999800706.25
120000000428925.38
-70000000000000
-40000000000000
-130000000000000

Expected Result :
1.6130193769216447e+208
1.6130193769216447e+208
-6.2394386208665224e+115
-6.745339049585425e+114
-8.4799458166284143e+22

Thanks in advance.

Hi!

Thanks for the question. I’ve checked, cuDSS just did what it was supposed to do with the given settings.

To understand the cuDSS behavior you should understand how pivoting works.

With default configuration settings, cuDSS does partial pivoting with a default pivoting epsilon (see in cuDSS Data Types — NVIDIA cuDSS) which is for double precision and GENERAL matrix type 1e-13. When the diagonal blocks of the matrix are factorized, cuDSS (due to the default value of CUDSS_PIVOT_COL) searches for the pivot element within the partial column, and then, after the swap, checks whether the pivot element is less than PIVOT_EPSILON (so essentially, pivoting epsilon tells cuDSS how small a value should be so that it is treated as a numerical zero). If the value is too small, it gets replaced with the PIVOT_EPSILON. Such pivots are called perturbed pivots. This is done to limit numerical instability.

You can check via cudssDataGet() for CUDSS_DATA_NPIVOTS after the factorization phase, that your matrix has 5 perturbed pivots (which makes sense, all values are too small compared to the default pivoting epsilon).

Usually, when perturbed pivots are present, iterative refinement may be used in order to restore the accuracy (if it works). By default, however, cuDSS does no iterative refinement steps (can be changed via CUDSS_CONFIG_IR_N_STEPS).
In your case, though, since the pivot epsilon is so large compared to the matrix elements, iterative refinement will not help.

To change the behavior, you have the following options:
a) [recommended right now] Set a small value (to practically disable the checks, smth like 1e-300 would do) for CUDSS_CONFIG_PIVIT_EPSILON via cudssConfigSet
b) [this will not work for your specific case as the scales are changing too wildly] Set a non-default pivot epsilon scaling algorithm via cudssConfigSet with CUDSS_ALG_1 for CUDSS_CONFIG_PIVOT_EPSILON_ALG + enable iterative refinement via CUDSS_CONFIG_IR_N_STEPS with, say, 2 iterative refinement steps.
c) [currently will not work as described and thus does not help, as it only disables swaps but not the epsilon checks] Disable pivoting altogether by setting CUDSS_PIVOT_NONE as CUDSS_CONFIG_PIVOT_TYPE
d) starting with one of the next releases, you can enable matching + scaling (and scaling is here the part which is the most relevant). When scaling is done, default pivoting epsilon should work well if the matrix is not too ill-conditioned.

Also note that with the examples of the nature you have, catastrophic cancellation might happen and the result might depend on the reordering permutation and order of operations during the parallel execution (this of course will only be relevant for larger test cases). So I’d be cautious to say what is the “expected” result in similar cases [with the usual numerical analysis estimates to be respected].

Let us know if you have further questions!

Thanks,
Kirill

1 Like

Hi Kirill,

Thanks for your detailed explanation, it worked well perfectly with set CUDSS_CONFIG_PIVOT_EPSILON up with 1e-300.

One more question, do you have any plan to add function for handling zero value of matrix diagonal like HSL MC64 in GPU? Using cuDSS, solving time in GPU are super fast, but preconditioning in CPU are not.

Thans,
SD2

Hi SD2,

Glad to hear that my advice helped.

Regarding matching (and scaling): yes, we have it in our plans to add support for it, it’s a very important functional feature. However, similarly to, e.g. nested dissection, matching algorithms (e.g., those used in HSL) are inherently non-GPU friendly and thus, it is likely that matching and scaling will be done on the CPU during the analysis/reordering phase, at least not in our first version where we will introduce it.

Thus, I won’t expect to have a time spent for such preprocessing decrease in cuDSS. In the future, we might work on improving performance (parallelize or move to a GPU) for this part if we encounter a large enough number of cases where it will be a major bottleneck.

Thanks,
Kirill