I am using cuDSS (via the nvidia-cudss-cu12wheel in Google Colab) and comparing it to CHOLMOD (scikit-sparse) on small SPD systems. I put a minimal Colab notebook together with a 5×5 SPD example and matching CPU/GPU results.
There are two features I rely on in CHOLMOD that I cannot find in cuDSS:
For an SPD system A, CHOLMOD exposes logdet(A) directly from the sparse Cholesky factor (from the diagonal of L). Request: after a successful SPD factorization in cuDSS, provide a way to query the log-determinant (or at least the diagonal of L) from cudssData_t, so we can compute Gaussian log-likelihoods and related quantities entirely on the GPU.
Applying P^T (L^{-1})^T (or (L^{-1})^T) to vectors/matrices
With CHOLMOD I can re-use the factorization A = P^TLL^TP to apply the inverse square root of A:
Z = rng.standard_normal((N, ndraw)) # columns ~ N(0, I)
U = numeric_factor.solve_Lt(Z, use_LDLt_decomposition=False) # L^T U = Z
V = numeric_factor.apply_Pt(U) # V = P^T U
# V = P^T L^{-T} Z, columns ~ N(0, A^{-1})
Request: I would like to do the analogous thing with cuDSS: after ANALYSIS + FACTORIZATION, call an API that applies the factors (and permutations) to a dense RHS on device:
This would enable GPU implementations of Gaussian sampling (covariance A^{-1}), preconditioners, etc., based directly on the cuDSS factors, instead of only full solves AX = B.
Minimal Colab notebook with CHOLMOD + cuDSS side-by-side.
Thanks for sharing your findings, we have thought about comparison with CHOLMOD but haven’t got to it yet.
Log-determinant from the Cholesky factor
We don’t have this feature yet. The closest we have is providing the factors diagonal via CUDSS_DATA_DIAG ( cuDSS Data Types — NVIDIA cuDSS )
For SPD matrices, this would give you diagonal of L with squared values.
Applying P^T (L^{-1})^T (or (L^{-1})^T) to vectors/matrices
We have solve sub-phases starting with cudss 0.7.0: CUDSS_PHASE_SOLVE_FWD_PERM, CUDSS_PHASE_SOLVE_FWD, …
See cuDSS Data Types — NVIDIA cuDSS and below
I believe this should cover your need (although, if you need to do smth like U^T solve for LU, this would not be possible).
It should work for multiple rhs.
Thank you for pointing me to CUDSS_DATA_DIAG and the split solve phases—this was exactly the right direction. I updated my Colab MWE and wanted to summarize what I’m seeing and confirm a couple of details.
Reading the documentation that you linked to, I initially worried that me trying to compute the log-determinant would force me to use CUDSS_ALG_1/2, while the CUDSS_PHASE_SOLVE_* sub-phases I need for applying the factor might not be compatible with precisely CUDSS_ALG_1/2. (the docs state e.g. for CUDSS_PHASE_SOLVE_FWD: Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering, whereas for CUDSS_DATA_DIAG, the docs state I need CUDSS_ALG_1 or CUDSS_ALG_2.) In practice I can run the solve sub-phases in my setup, but I’d appreciate confirmation on the intended/guaranteed combinations.
With CUDSS_ALG_1 (and similarly ALG_2), CUDSS_DATA_DIAG gives a logdet(A) that matches CHOLMOD exactly if I interpret DIAG as containing L_{ii}^2 (i.e., logdet = sum(log(DIAG))) - exactly as you explained in your answer.
Interestingly, with CUDSS_ALG_DEFAULT I can also read CUDSS_DATA_DIAG, but the raw sum(log(DIAG)) comes out to exactly half the CHOLMOD logdet. Multiplying by 2 matches CHOLMOD, suggesting that under ALG_DEFAULTDIAG contains L_{ii} rather than L_{ii}^2. Could you confirm whether this interpretation is correct and whether it is expected/documented behavior?
For applying the factor to a dense RHS, the following produces what I believe is V = P^T(L^{-1})^TZ (with Z dense n x k, V dense n x k and A the sparse SPD matrix):
cudssExecute(handle,
(cudssPhase_t)(CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM),
config, data, A, V, Z);
is this the correct way to replicate “solve with L^T then apply P^T” (i.e. the CHOLMOD pattern below) or is there a more appropriate cuDSS phase combination for this use case?
End goal is to match the CHOLMOD workflow (sksparse):
U = numeric_factor.solve_Lt(Z, use_LDLt_decomposition=False) # U = L^{-T} Z
V = numeric_factor.apply_Pt(U) # V = P^T U
Functionally, cuDSS produces samples with the desired covariance structure. However, in this toy test (A is 5×5, Z is 5×1,000,000), the cuDSS solve sub-phases are much slower than CHOLMOD; my suspicion is that this is an artifact of the tiny n and that larger sparse systems will be more favorable to cuDSS. I mainly want to confirm that I’m using the intended API path and not accidentally forcing an inefficient mode.
The reason CUDSS_DATA_DIAG is said to be not supported for other reordering options is because it does not take into account the local partial pivoting. Thus, the returned diagonal values are not ordered w.r.t to the CUDSS_DATA_PERM_REORDER_ROW or COL.
Thus, for your purpose of computing the logsum, it is supported.
Thanks for the observation! I’ll check the implementation and get back. The intention was to return the diagonal of the L factor, so the behavior of the case with CUDSS_ALG_DEFAULT with L_{ii} looks more correct to me.
See above.
Yes, your understanding of the solve sub-phases is correct and this combination should work.
Yes, cuDSS currently does not have the goal of having the best performance for tiny matrices and does some unnecessary operations which are amortized for large matrices. Also, for small matrices, it might be that going to the GPU does not make much sense as you cannot saturate it and just moving the data costs too much. For such cases, cuDSS supports hybrid execute mode: cuDSS Advanced Features — NVIDIA cuDSS where factorization and solve can be executed (currently, mostly sequentially) on the host (although, symbolic information currently still stays on the GPU so it’s not a pure CPU solver).
thanks again for your helpful guidance — I did some additional benchmarking with problem sizes closer to my actual application.
I tested banded SPD systems of size n ~28k×28k and ~50k×50k, with dense RHS matrices Z of size n × 200 and n×1000. Even at these relatively modest sizes, cuDSS already outperforms CHOLMOD quite clearly in the phases that matter most for me: numeric factorization and the application of P^TL^{-T} (i.e. backward triangular solve followed by inverse permutation). The speedup grows with bandwidth and number of RHS.
As expected, CHOLMOD is still faster in the analysis phase on the machine I tested. This is not surprising, since reordering in cuDSS is performed on the host; symbolic factorization is done on the device but is typically not the dominant cost. In any case, analysis is not performance-critical for my application since the sparsity pattern is fixed and analysis is performed only once, whereas numeric factorization and solves are repeated many times.
I remain interested in a broader comparison with CHOLMOD, so if you ever do a systematic investigation, maybe drop a short reply in this thread. I would appreciate it. Also, thank you for looking into the implementation/behavior of CUDSS_DATA_DIAG under CUDSS_ALG_DEFAULT — that clarification will be helpful.