# Solve AX=B With cuSolver

Hello
I’m trying to solve a linear system with cuSolver by using the LU factorization. And i’m currently using the getrf and getrs function. But i’m not sure how to recover the solution of the linear system AX=B. Is the parameter B supposed to be X?
If someone have an example for a linear resolution, it would be cool. I must admit, i find nothing with CuSolver.
My code is :

cusolverStatus_t  status;
cusolverDnHandle_t handle;
status=cusolverDnCreate(&handle);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"solve fail");
goto Error;
}
cudaStatus= cudaMalloc(&device_X,(i_bus*npq)*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
cudaStatus= cudaMalloc(&device_info,sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}

cudaStatus= cudaMalloc(&device_Ipiv,((i_bus-1+npq)*(i_bus-1+npq))*sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
int work_size=0;
double* device_work;
status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"buffersize failed");
goto Error;
}
cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
status=cusolverDnDgetrf(handle,i_bus-1+npq,i_bus-1+npq,device_J,(i_bus-1+npq)*sizeof(double),device_work,device_Ipiv,device_info);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"Factorisation LU failed");
goto Error;
}
status=cusolverDnDgetrs(handle,CUBLAS_OP_N,i_bus-1+npq,1,device_J,(i_bus-1+npq)*sizeof(double),device_Ipiv,device_M,(i_bus-1+npq)*sizeof(double),device_info);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"résolution failed");
goto Error;
}

Thanks

CUDA 7.5RC, available now to the public:

https://developer.nvidia.com/rdp/cuda-75-release-candidate

has new sample codes that include some cusolver sample codes. For example this one:

cuSolverDn_LinearSolver - cuSolverDn Linear Solver
A CUDA Sample that demonstrates cuSolverDN’s LU, QR and Cholesky factorization.

That sample contains the following function in cuSolverDn_LinearSolver.cpp, so you may want to take a look at that sample code:

int linearSolverLU(
cusolverDnHandle_t handle,
int n,
const double *Acopy,
int lda,
const double *b,
double *x)
{
int bufferSize = 0;
int *info = NULL;
double *buffer = NULL;
double *A = NULL;
int *ipiv = NULL; // pivoting sequence
int h_info = 0;
double start, stop;
double time_solve;

checkCudaErrors(cusolverDnDgetrf_bufferSize(handle, n, n, (double*)Acopy, lda, &bufferSize));

checkCudaErrors(cudaMalloc(&info, sizeof(int)));
checkCudaErrors(cudaMalloc(&buffer, sizeof(double)*bufferSize));
checkCudaErrors(cudaMalloc(&A, sizeof(double)*lda*n));
checkCudaErrors(cudaMalloc(&ipiv, sizeof(int)*n));

// prepare a copy of A because getrf will overwrite A with L
checkCudaErrors(cudaMemcpy(A, Acopy, sizeof(double)*lda*n, cudaMemcpyDeviceToDevice));
checkCudaErrors(cudaMemset(info, 0, sizeof(int)));

start = second();
start = second();

checkCudaErrors(cusolverDnDgetrf(handle, n, n, A, lda, buffer, ipiv, info));
checkCudaErrors(cudaMemcpy(&h_info, info, sizeof(int), cudaMemcpyDeviceToHost));

if ( 0 != h_info ){
fprintf(stderr, "Error: LU factorization failed\n");
}

checkCudaErrors(cudaMemcpy(x, b, sizeof(double)*n, cudaMemcpyDeviceToDevice));
checkCudaErrors(cusolverDnDgetrs(handle, CUBLAS_OP_N, n, 1, A, lda, ipiv, x, n, info));
stop = second();

time_solve = stop - start;
fprintf (stdout, "timing: LU = %10.6f sec\n", time_solve);

if (info  ) { checkCudaErrors(cudaFree(info  )); }
if (buffer) { checkCudaErrors(cudaFree(buffer)); }
if (A     ) { checkCudaErrors(cudaFree(A)); }
if (ipiv  ) { checkCudaErrors(cudaFree(ipiv));}

return 0;
}

Thanks, it helped me a lot
But Only the half of X is complete. And i find some Cudamemcheck error with the cusolverDnDgetrf function.
I am sure that the parameter is fine.
Does anyone have an idea about what reason can make Cudamemcheck error?

Do you mean cuda-memcheck errors with the sample code?

I don’t see any cuda-memcheck errors when I run the sample code:

\$ cuda-memcheck /usr/local/cuda/samples/bin/x86_64/linux/release/cuSolverDn_LinearSolver -R=lu -file=/usr/local/cuda/samples/7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx
========= CUDA-MEMCHECK
GPU Device 1: "GeForce GT 640" with compute capability 3.5

step 1: read matrix market format
Using input file [/usr/local/cuda/samples/7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
step 2: convert CSR(A) to dense matrix
step 3: set right hand side vector (b) to 1
step 4: prepare data on device
step 5: solve A*x = b
timing: LU =   0.537043 sec
step 6: evaluate residual
|b - A*x| = 9.947598E-14
|A| = 1.600000E+01
|x| = 2.357708E+01
|b - A*x|/(|A|*|x|) = 2.636988E-16
========= ERROR SUMMARY: 0 errors
\$

No, not with this code.
I have a matrice named J which is size of (i_bus-1+npq,i_bus-1+npq) and a vector M (i_bus-1+npq)
And i’m trying to solve the system J*x=M
And this is my code:

cusolverStatus_t  status;
cusolverDnHandle_t handle;
status=cusolverDnCreate(&handle);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"solve fail");
goto Error;
}
cudaStatus= cudaMalloc(&device_info,sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}

cudaStatus= cudaMalloc(&device_Ipiv,(i_bus-1+npq)*sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
int work_size=0;
double* device_work;
status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"buffersize failed");
goto Error;
}

cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
if(info!=0){
fprintf(stderr,"Error:LU factorization failed\n");
}
status=cusolverDnDgetrs(handle,CUBLAS_OP_N ,i_bus-1+npq,1,device_J,(i_bus-1+npq),device_Ipiv,device_M,(i_bus-1+npq),device_info);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"résolution failed");
goto Error;
}

And whith this code, the function cusolverDnDgetrf give me this kind of error:

GPU State:
----------------------------------------------------------------------------------------------------------------
1e00000000     4    adr st    g           0       1          {0,0,0}    {1,0,0}  _Z10dtrsv_initPi+000078
1e00000000     4    adr st    g           0       2          {0,0,0}    {2,0,0}  _Z10dtrsv_initPi+000078
1e00000000     4    adr st    g           0       3          {0,0,0}    {3,0,0}  _Z10dtrsv_initPi+000078

Summary of access violations:
================================================================================

Memory Checker detected 3 access violations.
error = access violation on store (global memory)
gridid = 38
blockIdx = {0,0,0}
accessSize = 4

And i don’t understand why because it’s seems fine to me.

If you want to provide a complete code that I could copy, paste, compile and run, without having to add anything or change anything, I’ll take a look. Otherwise perhaps someone else can spot an error in what you have shown.

Hey
This is a code which compile and run with me. It need Cusolver obviously. Don’t pay attention to the initialisation of the matrix. It’s in the aim to give the best example of what i do.
I don’t know if the data in the matrix matter or not so i give it to.
Thanks for helping

#include "cusolverDn.h"
#include <iostream>
#include "device_launch_parameters.h"
#include "cuda_runtime.h"
int main()
{

int i_bus=16;
int j_bus=10;
int npq=14;
int work_size=0;
int info=0;
double** Temp_j= new double*[i_bus-1+npq];
double Host_M[29]={8e+006,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; //i_bus-1+npq
double temp_j0[29]={2e+008,0,0,0,0,0,0,0,-2e+008,0,0,0,0,0,0,0,0,0,0,0,0,0,-10000,0,0,0,0,0,0};
double temp_j1[29]={0,4e+008,-2e+008,-2e+008,0,0,0,0,0,0,0,0,0,0,0,20000,-10000,-10000,0,0,0,0,0,0,0,0,0,0,0};
double temp_j2[29]={0,-2e+008,4e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,20000,0,0,0,0,0,0,0,0,0,0,0,0};
double temp_j3[29]={0,-2e+008,0,6e+008,-2e+008,0,0,-2e+008,0,0,0,0,0,0,0,-10000,0,30000,-10000,0,0,-10000,0,0,0,0,0,0,0};
double temp_j4[29]={0,0,0,-2e+008,4e+008,0,0,0,0,0,0,-2e+008,0,0,0,0,0,-10000,20000,0,0,0,0,0,0,-10000,0,0,0};
double temp_j5[29]={0,0,0,0,0,6e+008,-2e+008,-2e+008,0,0,-2e+008,0,0,0,0,0,0,0,0,30000,-10000,-10000,0,0,-10000,0,0,0,0};
double temp_j6[29]={0,0,0,0,0,-2e+008,4e+008,0,0,-2e+008,0,0,0,0,0,0,0,0,0,-10000,20000,0,0,-10000,0,0,0,0,0};
double temp_j7[29]={0,0,0,-2e+008,0,-2e+008,0,6e+008,0,0,0,0,-2e+008,0,0,0,0,-10000,0,-10000,0,30000,0,0,0,0,-10000,0,0};
double temp_j8[29]={-2e+008,0,0,0,0,0,0,0,4e+008,0,0,0,-2e+008,0,0,0,0,0,0,0,0,0,20000,0,0,0,-10000,0,0};
double temp_j9[29]={0,0,0,0,0,0,-2e+008,0,0,4e+008,0,0,0,-2e+008,0,0,0,0,0,0,-10000,0,0,20000,0,0,0,-10000,0};
double temp_j10[29]={0,0,0,0,0,-2e+008,0,0,0,0,2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,0,10000,0,0,0,0};
double temp_j11[29]={0,0,0,0,-2e+008,0,0,0,0,0,0,2e+008,0,0,0,0,0,0,-10000,0,0,0,0,0,0,10000,0,0,0};
double temp_j12[29]={0,0,0,0,0,0,0,-2e+008,-2e+008,0,0,0,4e+008,0,0,0,0,0,0,0,0,-10000,-10000,0,0,0,20000,0,0};
double temp_j13[29]={0,0,0,0,0,0,0,0,0,-2e+008,0,0,0,4e+008,-2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,20000,-10000};
double temp_j14[29]={0,0,0,0,0,0,0,0,0,0,0,0,0,-2e+008,2e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,10000};
double temp_j15[29]={-0,-4e+008,2e+008,2e+008,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,20000,-10000,-10000,0,0,0,0,0,0,0,0,0,0,0};
double temp_j16[29]={-0,2e+008,-4e+008,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-10000,20000,0,0,0,0,0,0,0,0,0,0,0,0};
double temp_j17[29]={-0,2e+008,-0,-6e+008,2e+008,-0,-0,2e+008,-0,-0,-0,-0,-0,-0,-0,-10000,0,30000,-10000,0,0,-10000,0,0,0,0,0,0,0};
double temp_j18[29]={-0,-0,-0,2e+008,-4e+008,-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-0,0,0,-10000,20000,0,0,0,0,0,0,-10000,0,0,0};
double temp_j19[29]={-0,-0,-0,-0,-0,-6e+008,2e+008,2e+008,-0,-0,2e+008,-0,-0,-0,-0,0,0,0,0,30000,-10000,-10000,0,0,-10000,0,0,0,0};
double temp_j20[29]={-0,-0,-0,-0,-0,2e+008,-4e+008,-0,-0,2e+008,-0,-0,-0,-0,-0,0,0,0,0,-10000,20000,0,0,-10000,0,0,0,0,0};
double temp_j21[29]={-0,-0,-0,2e+008,-0,2e+008,-0,-6e+008,-0,-0,-0,-0,2e+008,-0,-0,0,0,-10000,0,-10000,0,30000,0,0,0,0,-10000,0,0};
double temp_j22[29]={2e+008,-0,-0,-0,-0,-0,-0,-0,-4e+008,-0,-0,-0,2e+008,-0,-0,0,0,0,0,0,0,0,20000,0,0,0,-10000,0,0};
double temp_j23[29]={-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-4e+008,-0,-0,-0,2e+008,-0,0,0,0,0,0,-10000,0,0,20000,0,0,0,-10000,0};
double temp_j24[29]={-0,-0,-0,-0,-0,2e+008,-0,-0,-0,-0,-2e+008,-0,-0,-0,-0,0,0,0,0,-10000,0,0,0,0,10000,0,0,0,0};
double temp_j25[29]={-0,-0,-0,-0,2e+008,-0,-0,-0,-0,-0,-0,-2e+008,-0,-0,-0,0,0,0,-10000,0,0,0,0,0,0,10000,0,0,0};
double temp_j26[29]={-0,-0,-0,-0,-0,-0,-0,2e+008,2e+008,-0,-0,-0,-4e+008,-0,-0,0,0,0,0,0,0,-10000,-10000,0,0,0,20000,0,0};
double temp_j27[29]={-0,-0,-0,-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-0,-4e+008,2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,20000,-10000};
double temp_j28[29]={-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,2e+008,-2e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,10000};
Temp_j[0]=temp_j0;
Temp_j[1]=temp_j1;
Temp_j[2]=temp_j2;
Temp_j[3]=temp_j3;
Temp_j[4]=temp_j4;
Temp_j[5]=temp_j5;
Temp_j[6]=temp_j6;
Temp_j[7]=temp_j7;
Temp_j[8]=temp_j8;
Temp_j[9]=temp_j9;
Temp_j[10]=temp_j10;
Temp_j[11]=temp_j11;
Temp_j[12]=temp_j12;
Temp_j[13]=temp_j13;
Temp_j[14]=temp_j14;
Temp_j[15]=temp_j15;
Temp_j[16]=temp_j16;
Temp_j[17]=temp_j17;
Temp_j[18]=temp_j18;
Temp_j[19]=temp_j19;
Temp_j[20]=temp_j20;
Temp_j[21]=temp_j21;
Temp_j[22]=temp_j22;
Temp_j[23]=temp_j23;
Temp_j[24]=temp_j24;
Temp_j[25]=temp_j25;
Temp_j[26]=temp_j26;
Temp_j[27]=temp_j27;
Temp_j[28]=temp_j28;
double* Host_J= new double [(i_bus-1+npq)*(i_bus-1+npq)];
double* device_M;
double* device_J;
int* device_info;
int* device_Ipiv;
int flat=0;
for(int i=0; i<(i_bus-1+npq);i++){
for(int j=0;j<(i_bus-1+npq);j++){
Host_J[flat]=Temp_j[i][j];
flat++;
}
}
cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
goto Error;
}
cudaStatus=cudaMalloc(&device_J, (i_bus-1+npq)*(i_bus-1+npq)*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
cudaStatus=cudaMalloc(&device_M,(i_bus-1+npq)*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudamalloc failed");
}
cudaStatus= cudaMalloc(&device_info,sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}

cudaStatus= cudaMalloc(&device_Ipiv,(i_bus-1+npq)*sizeof(int));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
cudaStatus= cudaMemcpy(device_J,Host_J,(i_bus-1+npq)*(i_bus-1+npq)*sizeof(double),cudaMemcpyHostToDevice);
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMemcpy failed");
}
cudaStatus= cudaMemcpy(device_M,Host_M,(i_bus-1+npq)*sizeof(double),cudaMemcpyHostToDevice);
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMemcpy failed");
}
cusolverStatus_t  status;
cusolverDnHandle_t handle;
status=cusolverDnCreate(&handle);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"solve fail");
goto Error;
}
double* device_work;
status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"buffersize failed");
goto Error;
}

cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"cudaMalloc failed");
goto Error;
}
status=cusolverDnDgetrf(handle,i_bus-1+npq,i_bus-1+npq,device_J,(i_bus-1+npq),device_work,device_Ipiv,device_info);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"Factorisation LU failed");
goto Error;
}
cudaStatus=cudaMemcpy(&info, device_info,sizeof(int),cudaMemcpyDeviceToHost);
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"memcpy fail");
goto Error;
}
if(info!=0){
fprintf(stderr,"Error:LU factorization failed\n");
}
status=cusolverDnDgetrs(handle,CUBLAS_OP_N ,i_bus-1+npq,1,device_J,(i_bus-1+npq),device_Ipiv,device_M,(i_bus-1+npq),device_info);
if(status!=CUSOLVER_STATUS_SUCCESS){
fprintf(stderr,"résolution failed");
goto Error;
}
cudaStatus=cudaMemcpy(&info, device_info,sizeof(int),cudaMemcpyDeviceToHost);
if(cudaStatus!=cudaSuccess){
fprintf(stderr,"memcpy fail");
goto Error;
}
if(info!=0){
fprintf(stderr,"Error:solved fail\n");
}
int test;
std::cin>>test;
return 0;
Error:
cudaFree(device_M);
cudaFree(device_J);
cudaFree(device_Ipiv);
cudaFree(device_info);
}

I don’t see any errors compiling or running your code.

When I run it with cuda-memcheck:

\$ nvcc -o t885 t885.cu -lcusolver
t885.cu(10): warning: variable "j_bus" was declared but never referenced

t885.cu(10): warning: variable "j_bus" was declared but never referenced

\$ cuda-memcheck ./t885
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
\$

(I removed the std::cin at the end.)

I get no errors. I am using linux, I suspect you are using windows.

What exactly is the problem? If the data is not as you expect, please have the code output the data that you think is a problem, and then identify what you think the correct data should be.

If the only concern is errors reported by the CUDA memory checker that is built into visual studio, you may need to spell out the exact platform you are using.

Are you now using CUDA 7.5RC ? If not, please update to that version. Also, if the only problem is the cuda memory checker built into visual studio, please re-run the command line version of the app, outside of visual studio, using cuda-memcheck, and report those results.

I update cuda to 7.5, and Nvsight don’t see any error now.
But yeah the data i’m expected for X is slightly different
Here is the data i have by using CUDA

0.28
0.08
0.04
0.12
0.12
0.16
0.16
0.16
0.24
0.16
0.16
0.12
0.2
0.16
0.16
-1.59748e-017
-8.70625e-018
-1.94097e-017
-3.62187e-017
-1.88343e-017
-3.0594e-017
-1.83001e-017
-7.47395e-018
-4.03762e-017
3.89248e-018
-4.45395e-017
-9.1973e-018
-4.46267e-017
-4.41709e-017

And here when i’m using a LU resolution with Eigen:

0.28
0.08
0.04
0.12
0.12
0.16
0.16
0.16
0.24
0.16
0.16
0.12
0.2
0.16
0.16
-1.03244e-012
-5.16219e-013
-1.54866e-012
-1.92119e-012
-1.90841e-012
-2.12448e-012
-1.69235e-012
-5.64115e-013
-2.12448e-012
-1.90841e-012
-1.98824e-012
-1.12823e-012
-2.12448e-012
-2.12448e-012

I don’t know why only the half of the vector is correct.

There may be two things going on.

1. cusolver expects data in column-major form. It’s not clear to me whether you’ve taken this into account, but when I transpose your A matrix, I get results that are closer to your eigen results in magnitude, but still not identical.
2. The reason for the disparity may also have to do with matrix conditioning. If I subtract two double quantities that are in the range of your first 15 outputs, that are approximately equal, I’m going to get a number that is approximately 1e-12, which is where your eigen results are at (and this seems to be possible with your given A matrix, although I haven’t worked through it). In this case, I wouldn’t be surprised if there are differences between a serial code and a parallel code at this range.

I’ve tried to transpose A to a column-major matrix. And i suppose that the difference between what i get now isn’t important. So cuSolver work well thank you for your help.