We’ve been benchmarking CuDSS and noticed something unexpected regarding floating-point precision and relative backward error (RBE). Specifically, FP32 and FP64 on GPU yield nearly identical RBE values, especially on smaller problems.
The system under consideration is a structurally discretized ANCF beam model, with input matrices stored in CSR format. The sparse linear system Ax=b is solved using the cuDSS library
Here are the results (all times in milliseconds):
Column 1
Column 2
C
D
E
Column 3
Column 4
Matrix Size
Precision
Analysis
Factorization
Solve
Total Time
RBE
4973 × 4973
float
19.77
5.69
2.84
28.31
5.69E-07
4973 × 4973
double
19.76
7.12
3.22
30.11
5.44E-07
83485 × 83485
float
148.88
14.82
3.69
167.39
1.64E-09
83485 × 83485
double
149.12
32.31
4.69
186.14
3.94E-10
The full problem data is available and reproducible here:
Questions:
What could be causing such similar RBE values between FP32 and FP64 on the GPU?
Are there known cases where solvers internally downcast or mix precision for performance?
Also, what are the best ways to verify the actual precision being used at runtime? Would inspecting the PTX or SASS output help in understanding whether true FP64 operations are taking place?
Additionally:
Should we expect a larger speedup when using FP32 compared to FP64?
Is it okay for the analysis time for float sometimes higher than for double?
Any insight or suggestions would be greatly appreciated!
I haven’t checked your systems on my end but there are several things to consider here.
Do you see any perturbed pivots (from cudssDataGet() with CUDSS_DATA_NPIVOTS)? I suspect that the answer is yes and give some tips below for this case.
If so, then the accuracy can be limited by the pivoting epsilon (which you can change). Also, if perturbed pivots are present, usually it helps to include ~2 steps of iterative refinement (e.g., this is was some CPU solvers do by default, but in cuDSS the default number of iterative refinement steps is set to 0; you can change it via cudssConfigSet with CUDSS_CONFIG_IR_NSTEPS).
Also, if pivots are present, another thing to try is the global pivoting, which would work if you use CUDSS_ALG_1 or CUDSS_ALG_2 for reordering. These reordering algorithms force a different code path which is typically slower than the default one but can help investigate accuracy if needed.
One other thing to try in the presence of perturbed pivots is turning matching/scaling on.
However, if you don’t see any perturbed pivots, the case is more interesting.
Lastly, your systems seem to be pretty tiny. For 4973 x 4973 system I’d expect you would see better performance with the hybrid execute mode and passing your data on the CPU ~ essentially, a CPU code path (which at the moment targets small systems with the max size of 30k by 30k).
Even for 83385 x 83845 there can be a substantial overhead for just moving data to/from GPU.
I’d suggest using bigger matrices for getting more representative performance results (if it makes sense for the application). In case you have good reasons to stay with these sizes, I’d check GPU utilization for the larger one and think about the way to inrease it (e.g. batching, or solving with multiple rhs).
We don’t do any downcasts or change in precision inside cudss at the moment. So for nice systems one should see the accuracy for fp64 much better than for fp32 (close or not to the machine precision depending on the conditioning number).
In terms of speedup for fp32 vs fp64, it depends on the GPU (both compute peak for the respective precision and memory bandwidth). It’s very hard to say anything less general without details. For tiny systems, the latencies can also be a big factor. I’d say, profiling with nsight should help analyze this aspect.
This is not expected. I’d think that maybe it’s instability rather than a trend. Especially for smaller systems, there can be many factors to account for, like thread affinity, synchronization/latencies, presence of the warmup etc.
Thanks a lot for your detailed reply. I ran several tests based on your suggestions and wanted to follow up with observations from both the 4973×4973 and 83485×83485 systems.
Perturbed Pivots
Yes, perturbed pivots are present, especially in the larger system which seems to correlate with the relatively high RBE we initially observed.
83485×83485
Float: 5978
Double: 2812
4973×4973
Float: 315
Double: 315
Iterative Refinement
Tried enabling 1 and 2 steps of iterative refinement for both systems, but didn’t observe any meaningful improvement in RBE.
Reordering Algorithms & Global Pivoting
For the 4973×4973 system, switching to CUDSS_ALG_1 had a positive impact - RBE improved and matched the CPU solver. The number of pivots was also down to zero when I switched to these.
For the 83485×83485 system, using CUDSS_ALG_1 or 2 unfortunately results in a status 2/4 error from the solver. I’m still looking into the root cause, might be matrix structure or unsupported config. I made sure to use CUDSS_DEFAULT for CUDSS_CONFIG_PIVOT_EPSILON_ALG. Any tips on how to address or debug this error would be greatly appreciated.
Accuracy after updating solver settings
For the small matrix, enabling global pivoting (via ALG_1) clearly improves accuracy.
For the large matrix, tuning pivot_epsilon without switching reordering worked. I experimented with reducing it from the default (1e-13) down to 1e-8 while keeping the default reordering algorithm — this reduced the perturbed pivot count significantly and brought the RBE much closer to the CPU solver. Definitely a big improvement. I’m currently digging deeper into understanding the FP32 vs FP64 speed differences.
We’re currently just testing the solver in isolation for validation purposes. So even this phase is more about correctness and reproducibility than throughput and we already have reference CPU solver results. That said, we do plan to move the full pipeline onto the GPU eventually, which will include assembling and preparing the system natively on-device. Thanks again for all the pointers!