I have implemented LU Decomposition with help of CuSparse library in Cuda C. This code successfully work on the single GPU. Now, I want to implement it on Multi GPU platform. In that case, which library will helpful for implementation of LU Decomposition on Multi GPU ?
There is already an incomplete sparse LU in cuSparse, so it sounds like you implemented the complete version if you are telling the truth.
If so please answer these questions;
Did you implement partial-pivoting? If so what approach did you use?
Did you use the ‘left-looking’ or ‘right-looking’ method?
What pre-processing steps did you use to decrease the number of non-zeros in the result? AMD? COLAMD? HSL_M64?
Does your implementation support complex numbers?
Did you validate your implementation against a reference which is known to be correct? A good reference would be SuperLU or MATLAB.
Did you add a thresholding input and the ability to determine when to pivot away from the diagonal?
Do you have an equilibrate step for the rows and columns to improve the condition of the matrix?
How did you break down the workload into independent ordered chunks? Which graph algorithm did you use to break down the workload?
Do you have a symbolic pre-processing step, or did you find some way of dynamically pivoting through the main process and dynamically generating the new needed non-zero locations?
Are your results numerically stable when used to solve AX=B?
I have found that there is a great deal of fraud out there when it comes amateur GPU based sparse LU linear algebra sub-routines. Before you get too excited about your implementation compare your results again SuperLU for a large sparse matrix which has a higher condition number.