Why I cannot get correct results from cudss solver for a simple matrix?

I create a simple matrix, but could get the expected results, I pasted all these CSR matrix format into the first cudss example. I expected results are: [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9 10], but cuss results are:
[-0.1546, 0.000, -0.0759, 24.8304, 26.5462, 8.8571, -0.0, 18.8404, -0.0085, -0.000, 32.0363, -0.000, 32.363, -0.000, 11.8571, -0.0000, -2.1027, 33.5304, -6.44444, 3.8033, -0.0002, 81.3333]
I tested in other sparse solver and they give expected results.
My cudss version is 0.5. The GPU is laptop RTX1000.

Not sure is my build issue or a bug in cudss?

The CSR matrix is:
int n = 20;
int nnz = 96;
int Ap = {
0, 4, 11, 16, 21, 26, 30, 36, 40, 44,
49, 54, 60, 63, 69, 73, 78, 82, 87, 92,
96
};
int Ai = {
0, 9, 10, 14, 1, 2, 4, 5, 7, 17,
19, 1, 2, 3, 8, 15, 2, 3, 6, 9,
11, 1, 4, 13, 14, 18, 1, 5, 9, 11,
3, 6, 11, 12, 15, 19, 1, 7, 8, 13,
2, 7, 8, 10, 0, 3, 5, 9, 16, 0,
8, 10, 11, 18, 3, 5, 6, 10, 11, 14,
6, 12, 13, 4, 7, 12, 13, 16, 17, 0,
4, 11, 14, 2, 6, 15, 17, 18, 9, 13,
16, 19, 1, 13, 15, 17, 18, 4, 10, 15,
17, 18, 1, 6, 16, 19
};
double Ax = {
4, 5, 7, 5, 7, 4, 5, 2, 8, 2,
5, 4, 4, 2, 6, 7, 2, 5, 4, 6,
5, 5, 7, 3, 8, 8, 2, 7, 1, 3,
4, 4, 6, 6, 7, 3, 8, 6, 5, 1,
6, 5, 5, 3, 5, 6, 1, 4, 4, 7,
3, 3, 2, 6, 5, 3, 6, 2, 1, 5,
6, 7, 5, 3, 1, 5, 8, 4, 3, 5,
8, 5, 1, 7, 7, 4, 3, 3, 4, 4,
6, 3, 2, 3, 3, 3, 6, 8, 6, 3,
6, 1, 5, 3, 3, 1
};
double B = {
86, 193, 124, 124, 169, 62, 146, 113, 106, 103,
95, 109, 83, 122, 60, 145, 128, 112, 121, 62
};

Hi!

As a super-quick guess without checking. Did you set the matrix type, view and indexing correctly? If you took one of our examples or samples, the matrix there may be (depending on the example) said to be symmetric.

Typically there is code like
cudssMatrix_t A;
cudssMatrixType_t mtype = CUDSS_MTYPE_SPD;
cudssMatrixViewType_t mview = CUDSS_MVIEW_UPPER;
cudssIndexBase_t base = CUDSS_BASE_ZERO;
in our examples where you would need to put CUDSS_MTYPE_GENERAL (then view doesn’t matter).

If this doesn’t resolve the issue, we’ll have a closer look.

Thanks,
Kirill

Dear Kvoronin,

I wanted to thank you for your suggestion regarding the use of cuDSS for my matrix operations. As it turns out,
the matrix I’m working with is indeed well-conditioned and symmetrical, which led me to follow the first example
and plug in new numbers. However, upon changing the mtype to CUDSS_MTYPE_GENERAL or SYMMETRICAL as you suggested, I obtained correct results.

I am investigating to improve circuit simulator speed using cuDSS, I’ve conducted some tests with
matrix sizes ranging from 1000 to 5000. Unfortunately, my findings indicate that cuDSS is not significantly faster
than other direct solver on CPU for these sizes on my laptop GPU (RTX A1000). In fact, my results show a 10-20x slowdown compared to others.

I’m curious to know if this observation aligns with your experience or if I may be missing some solver settings.
Most articles suggest that cuDSS excels over CPU-type solvers like KLU when the matrix size is sufficiently large.
Could you provide insight into what size matrices might trigger this performance advantage? Additionally, given my
specific matrix range (1000-5000), do you think it’s worth exploring the use of cuDSS on GPU?

I appreciate your expertise and look forward to hearing back from you.

Min

Hi Min!

Great to hear that the functional issue is resolved now.

Regarding the performance: the matrix sizes you mention (1k to 5k) are quite small indeed [and I’m curious as to why you are not solving larger system for circuit simulation). Typically, such small matrices can’t really exploit the amount of parallelism and memory BW of a GPU. Also very likely is that for these small matrices, there are extra overheads just from moving memory from the host to the device, or on necessary synchronizations.

Very likely cuDSS for such small matrices cannot be significantly faster than CPU solvers but there are things which you can do to make cuDSS not have this large slowdown. Then you can use cuDSS for small matrices, as well as for the large matrices, where performance benefits will be much more pronouncned.

I suggest to do the following:

  1. Split your time measurements between phases (analysis, factorization, solve) and compare times separately between them. I’m assuming the CPU solver you’re comparing against also has something equivalent.
  2. In case factorization & solve times are non-negligible compared to analysis, try to enable cuDSS host execute mode (see for full description cuDSS Advanced Features — NVIDIA cuDSS).
    Basically, you need to set CUDSS_CONFIG_HYBRID_EXECUTE_MODE. Also, you may pass your data then as host pointers which might bring extra speedup.
  3. In case analysis phase takes too much time compared to your baseline solver, try to use multithreaded mode, see cuDSS Advanced Features — NVIDIA cuDSS. You can try just a handful of host threads, like no more than 8 and see if perf improves.

The alternative direction of making your application more GPU friendly is to batch the problems to solve into one giant system or to increase the number of rhs. Basically, to make good use of a GPU you need to make your problem bigger, if the application allows it.

Thanks,
Kirill

Hi Kirill,

I wanted to thank you for your suggestion and follow up on my investigation into using cuDSS for matrix operations. I’ve conducted further testing on an NVIDIA T4 GPU, and my results show that cuDSS outperform traditional CPU solvers when the matrix size increases.

Specifically, when the matrix reaches approximately 2000-3000 elements, cuDSS begins to outperform traditional CPU solvers. However, when I scaled up the matrix size to around 20k elements, the performance advantage of cuDSS became even more pronounced.

I noticed that the time spent in the symbolic analysis phase dominates the overall execution time for cuDSS, whereas numerical factorization becomes the main bottleneck as the matrix grows larger using traditional CPU solvers. I’m optimistic that upgrading to a more powerful NVIDIA A100 GPU could further improve the performance of cuDSS. I also measured the time used for copy data between host and device, but it is not significant compared to the analyze, factorize and solve steps.

Thank you again for your suggestion.
Best regards,
Min