Solve AX=B With cuSolver

miyoku · August 13, 2015, 2:46pm

Hello
I’m trying to solve a linear system with cuSolver by using the LU factorization. And i’m currently using the getrf and getrs function. But i’m not sure how to recover the solution of the linear system AX=B. Is the parameter B supposed to be X?
If someone have an example for a linear resolution, it would be cool. I must admit, i find nothing with CuSolver.
My code is :

cusolverStatus_t  status;
	cusolverDnHandle_t handle;
	status=cusolverDnCreate(&handle);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"solve fail");
		goto Error;
	}
	cudaStatus= cudaMalloc(&device_X,(i_bus*npq)*sizeof(double));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
	cudaStatus= cudaMalloc(&device_info,sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}

	cudaStatus= cudaMalloc(&device_Ipiv,((i_bus-1+npq)*(i_bus-1+npq))*sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
                int work_size=0;
		double* device_work;
		status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
		if(status!=CUSOLVER_STATUS_SUCCESS){
			fprintf(stderr,"buffersize failed");
			goto Error;
		}
		cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
		if(cudaStatus!=cudaSuccess){
			fprintf(stderr,"cudaMalloc failed");
			goto Error;
		}
		status=cusolverDnDgetrf(handle,i_bus-1+npq,i_bus-1+npq,device_J,(i_bus-1+npq)*sizeof(double),device_work,device_Ipiv,device_info);
		if(status!=CUSOLVER_STATUS_SUCCESS){
			fprintf(stderr,"Factorisation LU failed");
			goto Error;
		}
		status=cusolverDnDgetrs(handle,CUBLAS_OP_N,i_bus-1+npq,1,device_J,(i_bus-1+npq)*sizeof(double),device_Ipiv,device_M,(i_bus-1+npq)*sizeof(double),device_info);
		if(status!=CUSOLVER_STATUS_SUCCESS){
			fprintf(stderr,"résolution failed");
			goto Error;
		}

Thanks

Robert_Crovella · August 13, 2015, 3:30pm

CUDA 7.5RC, available now to the public:

https://developer.nvidia.com/rdp/cuda-75-release-candidate

has new sample codes that include some cusolver sample codes. For example this one:

cuSolverDn_LinearSolver - cuSolverDn Linear Solver
A CUDA Sample that demonstrates cuSolverDN’s LU, QR and Cholesky factorization.

That sample contains the following function in cuSolverDn_LinearSolver.cpp, so you may want to take a look at that sample code:

int linearSolverLU(
    cusolverDnHandle_t handle,
    int n,
    const double *Acopy,
    int lda,
    const double *b,
    double *x)
{
    int bufferSize = 0;
    int *info = NULL;
    double *buffer = NULL;
    double *A = NULL;
    int *ipiv = NULL; // pivoting sequence
    int h_info = 0;
    double start, stop;
    double time_solve;

    checkCudaErrors(cusolverDnDgetrf_bufferSize(handle, n, n, (double*)Acopy, lda, &bufferSize));

    checkCudaErrors(cudaMalloc(&info, sizeof(int)));
    checkCudaErrors(cudaMalloc(&buffer, sizeof(double)*bufferSize));
    checkCudaErrors(cudaMalloc(&A, sizeof(double)*lda*n));
    checkCudaErrors(cudaMalloc(&ipiv, sizeof(int)*n));

// prepare a copy of A because getrf will overwrite A with L
    checkCudaErrors(cudaMemcpy(A, Acopy, sizeof(double)*lda*n, cudaMemcpyDeviceToDevice));
    checkCudaErrors(cudaMemset(info, 0, sizeof(int)));

    start = second();
    start = second();

    checkCudaErrors(cusolverDnDgetrf(handle, n, n, A, lda, buffer, ipiv, info));
    checkCudaErrors(cudaMemcpy(&h_info, info, sizeof(int), cudaMemcpyDeviceToHost));

    if ( 0 != h_info ){
        fprintf(stderr, "Error: LU factorization failed\n");
    }

    checkCudaErrors(cudaMemcpy(x, b, sizeof(double)*n, cudaMemcpyDeviceToDevice));
    checkCudaErrors(cusolverDnDgetrs(handle, CUBLAS_OP_N, n, 1, A, lda, ipiv, x, n, info));
    checkCudaErrors(cudaDeviceSynchronize());
    stop = second();

    time_solve = stop - start;
    fprintf (stdout, "timing: LU = %10.6f sec\n", time_solve);

    if (info  ) { checkCudaErrors(cudaFree(info  )); }
    if (buffer) { checkCudaErrors(cudaFree(buffer)); }
    if (A     ) { checkCudaErrors(cudaFree(A)); }
    if (ipiv  ) { checkCudaErrors(cudaFree(ipiv));}

    return 0;
}

miyoku · August 15, 2015, 8:31am

Thanks, it helped me a lot
But Only the half of X is complete. And i find some Cudamemcheck error with the cusolverDnDgetrf function.
I am sure that the parameter is fine.
Does anyone have an idea about what reason can make Cudamemcheck error?

Robert_Crovella · August 15, 2015, 12:49pm

Do you mean cuda-memcheck errors with the sample code?

I don’t see any cuda-memcheck errors when I run the sample code:

$ cuda-memcheck /usr/local/cuda/samples/bin/x86_64/linux/release/cuSolverDn_LinearSolver -R=lu -file=/usr/local/cuda/samples/7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx
========= CUDA-MEMCHECK
GPU Device 1: "GeForce GT 640" with compute capability 3.5

step 1: read matrix market format
Using input file [/usr/local/cuda/samples/7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
step 2: convert CSR(A) to dense matrix
step 3: set right hand side vector (b) to 1
step 4: prepare data on device
step 5: solve A*x = b
timing: LU =   0.537043 sec
step 6: evaluate residual
|b - A*x| = 9.947598E-14
|A| = 1.600000E+01
|x| = 2.357708E+01
|b - A*x|/(|A|*|x|) = 2.636988E-16
========= ERROR SUMMARY: 0 errors
$

miyoku · August 15, 2015, 3:19pm

No, not with this code.
I have a matrice named J which is size of (i_bus-1+npq,i_bus-1+npq) and a vector M (i_bus-1+npq)
And i’m trying to solve the system J*x=M
And this is my code:

cusolverStatus_t  status;
	cusolverDnHandle_t handle;
	status=cusolverDnCreate(&handle);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"solve fail");
		goto Error;
	}
        cudaStatus= cudaMalloc(&device_info,sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}

	cudaStatus= cudaMalloc(&device_Ipiv,(i_bus-1+npq)*sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
                int work_size=0;
		double* device_work;
		status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
		if(status!=CUSOLVER_STATUS_SUCCESS){
			fprintf(stderr,"buffersize failed");
			goto Error;
		}
		
		cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
		if(cudaStatus!=cudaSuccess){
			fprintf(stderr,"cudaMalloc failed");
			goto Error;
		}
                if(info!=0){
			fprintf(stderr,"Error:LU factorization failed\n");
		}
                status=cusolverDnDgetrs(handle,CUBLAS_OP_N ,i_bus-1+npq,1,device_J,(i_bus-1+npq),device_Ipiv,device_M,(i_bus-1+npq),device_info);
		if(status!=CUSOLVER_STATUS_SUCCESS){
			fprintf(stderr,"résolution failed");
			goto Error;
		}

And whith this code, the function cusolverDnDgetrf give me this kind of error:

GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                       PC  Source
----------------------------------------------------------------------------------------------------------------
1e00000000     4    adr st    g           0       1          {0,0,0}    {1,0,0}  _Z10dtrsv_initPi+000078        
1e00000000     4    adr st    g           0       2          {0,0,0}    {2,0,0}  _Z10dtrsv_initPi+000078        
1e00000000     4    adr st    g           0       3          {0,0,0}    {3,0,0}  _Z10dtrsv_initPi+000078        


Summary of access violations:
================================================================================

Memory Checker detected 3 access violations.
error = access violation on store (global memory)
gridid = 38
blockIdx = {0,0,0}
threadIdx = {1,0,0}
address = 0x1e00000000
accessSize = 4

And i don’t understand why because it’s seems fine to me.

Robert_Crovella · August 15, 2015, 5:27pm

If you want to provide a complete code that I could copy, paste, compile and run, without having to add anything or change anything, I’ll take a look. Otherwise perhaps someone else can spot an error in what you have shown.

miyoku · August 15, 2015, 8:26pm

Hey
This is a code which compile and run with me. It need Cusolver obviously. Don’t pay attention to the initialisation of the matrix. It’s in the aim to give the best example of what i do.
I don’t know if the data in the matrix matter or not so i give it to.
Thanks for helping

#include "cusolverDn.h"
#include <iostream>
#include "device_launch_parameters.h"
#include "cuda_runtime.h"
int main()
{

	int i_bus=16;
	int j_bus=10;
	int npq=14;
	int work_size=0;
	int info=0;
	double** Temp_j= new double*[i_bus-1+npq];
	double Host_M[29]={8e+006,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; //i_bus-1+npq
	double temp_j0[29]={2e+008,0,0,0,0,0,0,0,-2e+008,0,0,0,0,0,0,0,0,0,0,0,0,0,-10000,0,0,0,0,0,0};
	double temp_j1[29]={0,4e+008,-2e+008,-2e+008,0,0,0,0,0,0,0,0,0,0,0,20000,-10000,-10000,0,0,0,0,0,0,0,0,0,0,0};
	double temp_j2[29]={0,-2e+008,4e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,20000,0,0,0,0,0,0,0,0,0,0,0,0};
	double temp_j3[29]={0,-2e+008,0,6e+008,-2e+008,0,0,-2e+008,0,0,0,0,0,0,0,-10000,0,30000,-10000,0,0,-10000,0,0,0,0,0,0,0};
	double temp_j4[29]={0,0,0,-2e+008,4e+008,0,0,0,0,0,0,-2e+008,0,0,0,0,0,-10000,20000,0,0,0,0,0,0,-10000,0,0,0};
	double temp_j5[29]={0,0,0,0,0,6e+008,-2e+008,-2e+008,0,0,-2e+008,0,0,0,0,0,0,0,0,30000,-10000,-10000,0,0,-10000,0,0,0,0};
	double temp_j6[29]={0,0,0,0,0,-2e+008,4e+008,0,0,-2e+008,0,0,0,0,0,0,0,0,0,-10000,20000,0,0,-10000,0,0,0,0,0};
	double temp_j7[29]={0,0,0,-2e+008,0,-2e+008,0,6e+008,0,0,0,0,-2e+008,0,0,0,0,-10000,0,-10000,0,30000,0,0,0,0,-10000,0,0};
	double temp_j8[29]={-2e+008,0,0,0,0,0,0,0,4e+008,0,0,0,-2e+008,0,0,0,0,0,0,0,0,0,20000,0,0,0,-10000,0,0};
	double temp_j9[29]={0,0,0,0,0,0,-2e+008,0,0,4e+008,0,0,0,-2e+008,0,0,0,0,0,0,-10000,0,0,20000,0,0,0,-10000,0};
	double temp_j10[29]={0,0,0,0,0,-2e+008,0,0,0,0,2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,0,10000,0,0,0,0};
	double temp_j11[29]={0,0,0,0,-2e+008,0,0,0,0,0,0,2e+008,0,0,0,0,0,0,-10000,0,0,0,0,0,0,10000,0,0,0};
	double temp_j12[29]={0,0,0,0,0,0,0,-2e+008,-2e+008,0,0,0,4e+008,0,0,0,0,0,0,0,0,-10000,-10000,0,0,0,20000,0,0};
	double temp_j13[29]={0,0,0,0,0,0,0,0,0,-2e+008,0,0,0,4e+008,-2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,20000,-10000};
	double temp_j14[29]={0,0,0,0,0,0,0,0,0,0,0,0,0,-2e+008,2e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,10000};
	double temp_j15[29]={-0,-4e+008,2e+008,2e+008,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,20000,-10000,-10000,0,0,0,0,0,0,0,0,0,0,0};
	double temp_j16[29]={-0,2e+008,-4e+008,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-10000,20000,0,0,0,0,0,0,0,0,0,0,0,0};
	double temp_j17[29]={-0,2e+008,-0,-6e+008,2e+008,-0,-0,2e+008,-0,-0,-0,-0,-0,-0,-0,-10000,0,30000,-10000,0,0,-10000,0,0,0,0,0,0,0};
	double temp_j18[29]={-0,-0,-0,2e+008,-4e+008,-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-0,0,0,-10000,20000,0,0,0,0,0,0,-10000,0,0,0};
	double temp_j19[29]={-0,-0,-0,-0,-0,-6e+008,2e+008,2e+008,-0,-0,2e+008,-0,-0,-0,-0,0,0,0,0,30000,-10000,-10000,0,0,-10000,0,0,0,0};
	double temp_j20[29]={-0,-0,-0,-0,-0,2e+008,-4e+008,-0,-0,2e+008,-0,-0,-0,-0,-0,0,0,0,0,-10000,20000,0,0,-10000,0,0,0,0,0};
	double temp_j21[29]={-0,-0,-0,2e+008,-0,2e+008,-0,-6e+008,-0,-0,-0,-0,2e+008,-0,-0,0,0,-10000,0,-10000,0,30000,0,0,0,0,-10000,0,0};
	double temp_j22[29]={2e+008,-0,-0,-0,-0,-0,-0,-0,-4e+008,-0,-0,-0,2e+008,-0,-0,0,0,0,0,0,0,0,20000,0,0,0,-10000,0,0};
	double temp_j23[29]={-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-4e+008,-0,-0,-0,2e+008,-0,0,0,0,0,0,-10000,0,0,20000,0,0,0,-10000,0};
	double temp_j24[29]={-0,-0,-0,-0,-0,2e+008,-0,-0,-0,-0,-2e+008,-0,-0,-0,-0,0,0,0,0,-10000,0,0,0,0,10000,0,0,0,0};
	double temp_j25[29]={-0,-0,-0,-0,2e+008,-0,-0,-0,-0,-0,-0,-2e+008,-0,-0,-0,0,0,0,-10000,0,0,0,0,0,0,10000,0,0,0};
	double temp_j26[29]={-0,-0,-0,-0,-0,-0,-0,2e+008,2e+008,-0,-0,-0,-4e+008,-0,-0,0,0,0,0,0,0,-10000,-10000,0,0,0,20000,0,0};
	double temp_j27[29]={-0,-0,-0,-0,-0,-0,-0,-0,-0,2e+008,-0,-0,-0,-4e+008,2e+008,0,0,0,0,0,0,0,0,-10000,0,0,0,20000,-10000};
	double temp_j28[29]={-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,-0,2e+008,-2e+008,0,0,0,0,0,0,0,0,0,0,0,0,-10000,10000};
	Temp_j[0]=temp_j0;
	Temp_j[1]=temp_j1;
	Temp_j[2]=temp_j2;
	Temp_j[3]=temp_j3;
	Temp_j[4]=temp_j4;
	Temp_j[5]=temp_j5;
	Temp_j[6]=temp_j6;
	Temp_j[7]=temp_j7;
	Temp_j[8]=temp_j8;
	Temp_j[9]=temp_j9;
	Temp_j[10]=temp_j10;
	Temp_j[11]=temp_j11;
	Temp_j[12]=temp_j12;
	Temp_j[13]=temp_j13;
	Temp_j[14]=temp_j14;
	Temp_j[15]=temp_j15;
	Temp_j[16]=temp_j16;
	Temp_j[17]=temp_j17;
	Temp_j[18]=temp_j18;
	Temp_j[19]=temp_j19;
	Temp_j[20]=temp_j20;
	Temp_j[21]=temp_j21;
	Temp_j[22]=temp_j22;
	Temp_j[23]=temp_j23;
	Temp_j[24]=temp_j24;
	Temp_j[25]=temp_j25;
	Temp_j[26]=temp_j26;
	Temp_j[27]=temp_j27;
	Temp_j[28]=temp_j28;
	double* Host_J= new double [(i_bus-1+npq)*(i_bus-1+npq)];
	double* device_M;
	double* device_J;
	int* device_info;
	int* device_Ipiv;
	int flat=0;
	for(int i=0; i<(i_bus-1+npq);i++){
		for(int j=0;j<(i_bus-1+npq);j++){
			Host_J[flat]=Temp_j[i][j];
			flat++;
		}
	}
	cudaError_t cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }
	cudaStatus=cudaMalloc(&device_J, (i_bus-1+npq)*(i_bus-1+npq)*sizeof(double));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
	cudaStatus=cudaMalloc(&device_M,(i_bus-1+npq)*sizeof(double));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudamalloc failed");
	}
	cudaStatus= cudaMalloc(&device_info,sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}

	cudaStatus= cudaMalloc(&device_Ipiv,(i_bus-1+npq)*sizeof(int));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
	cudaStatus= cudaMemcpy(device_J,Host_J,(i_bus-1+npq)*(i_bus-1+npq)*sizeof(double),cudaMemcpyHostToDevice);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMemcpy failed");
	}
	cudaStatus= cudaMemcpy(device_M,Host_M,(i_bus-1+npq)*sizeof(double),cudaMemcpyHostToDevice);
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMemcpy failed");
	}
	cusolverStatus_t  status;
	cusolverDnHandle_t handle;
	status=cusolverDnCreate(&handle);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"solve fail");
		goto Error;
	}
	double* device_work;
	status= cusolverDnDgetrf_bufferSize(handle,i_bus-1+npq,i_bus-1+npq,device_J,i_bus-1+npq,&work_size);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"buffersize failed");
		goto Error;
	}
		
	cudaStatus= cudaMalloc(&device_work,work_size*sizeof(double));
	if(cudaStatus!=cudaSuccess){
		fprintf(stderr,"cudaMalloc failed");
		goto Error;
	}
	status=cusolverDnDgetrf(handle,i_bus-1+npq,i_bus-1+npq,device_J,(i_bus-1+npq),device_work,device_Ipiv,device_info);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"Factorisation LU failed");
		goto Error;
	}
	cudaStatus=cudaMemcpy(&info, device_info,sizeof(int),cudaMemcpyDeviceToHost);
		if(cudaStatus!=cudaSuccess){
			fprintf(stderr,"memcpy fail");
			goto Error;
		}
	if(info!=0){
			fprintf(stderr,"Error:LU factorization failed\n");
		}
	status=cusolverDnDgetrs(handle,CUBLAS_OP_N ,i_bus-1+npq,1,device_J,(i_bus-1+npq),device_Ipiv,device_M,(i_bus-1+npq),device_info);
	if(status!=CUSOLVER_STATUS_SUCCESS){
		fprintf(stderr,"résolution failed");
		goto Error;
	}
	cudaStatus=cudaMemcpy(&info, device_info,sizeof(int),cudaMemcpyDeviceToHost);
		if(cudaStatus!=cudaSuccess){
			fprintf(stderr,"memcpy fail");
			goto Error;
		}
		if(info!=0){
			fprintf(stderr,"Error:solved fail\n");
		}
	int test;
	std::cin>>test;
	return 0;
Error:
	cudaFree(device_M);
	cudaFree(device_J);
	cudaFree(device_Ipiv);
	cudaFree(device_info);
}

Robert_Crovella · August 16, 2015, 11:09pm

I don’t see any errors compiling or running your code.

When I run it with cuda-memcheck:

$ nvcc -o t885 t885.cu -lcusolver
t885.cu(10): warning: variable "j_bus" was declared but never referenced

t885.cu(10): warning: variable "j_bus" was declared but never referenced

$ cuda-memcheck ./t885
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

(I removed the std::cin at the end.)

I get no errors. I am using linux, I suspect you are using windows.

What exactly is the problem? If the data is not as you expect, please have the code output the data that you think is a problem, and then identify what you think the correct data should be.

If the only concern is errors reported by the CUDA memory checker that is built into visual studio, you may need to spell out the exact platform you are using.

Are you now using CUDA 7.5RC ? If not, please update to that version. Also, if the only problem is the cuda memory checker built into visual studio, please re-run the command line version of the app, outside of visual studio, using cuda-memcheck, and report those results.

miyoku · August 17, 2015, 12:17pm

I update cuda to 7.5, and Nvsight don’t see any error now.
But yeah the data i’m expected for X is slightly different
Here is the data i have by using CUDA

0.28
0.08
0.04
0.12
0.12
0.16
0.16
0.16
0.24
0.16
0.16
0.12
0.2
0.16
0.16
-1.59748e-017
-8.70625e-018
-1.94097e-017
-3.62187e-017
-1.88343e-017
-3.0594e-017
-1.83001e-017
-7.47395e-018
-4.03762e-017
3.89248e-018
-4.45395e-017
-9.1973e-018
-4.46267e-017
-4.41709e-017

And here when i’m using a LU resolution with Eigen:

0.28
0.08
0.04
0.12
0.12
0.16
0.16
0.16
0.24
0.16
0.16
0.12
0.2
0.16
0.16
-1.03244e-012
-5.16219e-013
-1.54866e-012
-1.92119e-012
-1.90841e-012
-2.12448e-012
-1.69235e-012
-5.64115e-013
-2.12448e-012
-1.90841e-012
-1.98824e-012
-1.12823e-012
-2.12448e-012
-2.12448e-012

I don’t know why only the half of the vector is correct.

Robert_Crovella · August 17, 2015, 2:59pm

There may be two things going on.

cusolver expects data in column-major form. It’s not clear to me whether you’ve taken this into account, but when I transpose your A matrix, I get results that are closer to your eigen results in magnitude, but still not identical.
The reason for the disparity may also have to do with matrix conditioning. If I subtract two double quantities that are in the range of your first 15 outputs, that are approximately equal, I’m going to get a number that is approximately 1e-12, which is where your eigen results are at (and this seems to be possible with your given A matrix, although I haven’t worked through it). In this case, I wouldn’t be surprised if there are differences between a serial code and a parallel code at this range.

miyoku · August 18, 2015, 1:17pm

I’ve tried to transpose A to a column-major matrix. And i suppose that the difference between what i get now isn’t important. So cuSolver work well thank you for your help.

Topic		Replies	Views
cublasZgemm() gives false result for large data and potential bug GPU-Accelerated Libraries	6	1149	October 12, 2021
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7691	January 11, 2020
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	108	October 29, 2024
cusolverDnSgetrf() fails on A100 (but not on A10) when called in a tight loop GPU-Accelerated Libraries cusolver	15	1889	February 23, 2022
The incorrect results calculated by the function "cusolverDnXgetrs" when calling it again nvc, nvc++ and nvfortran	11	42	February 12, 2025
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10007	January 18, 2012
The solution result of cuSolver is different from cv::solve Jetson Xavier NX cuda	2	705	November 4, 2021
cuSOLVER sparse: cusolverSpDcsrlsvqr() error GPU-Accelerated Libraries	4	3815	June 7, 2015
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2711	January 19, 2019
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2886	April 28, 2018

Solve AX=B With cuSolver

Related topics