Help with cublasDtrsm (or cublasStrsm) Incorrect output

I am facing problem calling the cublasDtrsm() function from cublas library. The output I am getting is incorrect. For example, I am trying to solve a matrix equation AX=alphaB. I am getting value of X to be the exact same as alpha*B. Could you please point out where I am going wrong? Is my call to cublasSDtrsm() correct?

I am attaching the sample output for matrices A, B and X below. The entire code is also attached.

Printing A: 

         1          0          0          0          0 

  0.783099          1          0          0          0 

  0.911647   0.364784          1          0          0 

  0.335223    0.95223   0.242887          1          0 

  0.277775   0.635712   0.804177   0.218257          1 

Printing B: 

  0.840188    0.55397   0.717297   0.156679   0.512932 

  0.394383   0.477397   0.141603   0.400944   0.839112 

   0.79844   0.628871   0.606969    0.12979    0.61264 

  0.197551   0.513401  0.0163006   0.108809   0.296032 

   0.76823   0.916195   0.137232   0.998925   0.637552 

Now calling cublasDtrsm

Printing X: 

  0.840188    0.55397   0.717297   0.156679   0.512932 

  0.394383   0.477397   0.141603   0.400944   0.839112 

   0.79844   0.628871   0.606969    0.12979    0.61264 

  0.197551   0.513401  0.0163006   0.108809   0.296032 

   0.76823   0.916195   0.137232   0.998925   0.637552

Here is the entire code:

#include <iostream>

#include <cstdlib>

#include <iomanip>

#include "cublas.h"

#define N (5)

#define IDX2C(i,j,N) (((j)*(N))+(i))

typedef unsigned int uint;

using namespace std;

void print_matrix(double *A, int num) {

	for(int i=0; i<num*num; i++) {

		cout<<setw(10)<<setprecision(6)<<A[i]<<" ";

		if((i+1)%N==0) cout<<endl;

	}

}

int main(int argc, char *argv[]) {

	cublasStatus status;

	double *h_A;

	double *h_B;

	double *d_A = 0;

	double *d_B = 0;

	double alpha = 1.0;

	int n2 = N*N;

	int i;

	status = cublasInit();

	if (status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! CUBLAS initialization error"<<endl;

		return EXIT_FAILURE;

	}

	h_A = new double[n2];

	if (h_A == 0) {

		cerr<<"!!!! host memory allocation error (A)"<<endl;

		return EXIT_FAILURE;

	}

	h_B = new double[n2];

	if (h_B == 0) {

		cerr<<"!!!! host memory allocation error (B)"<<endl;

		return EXIT_FAILURE;

	}

	/*

	h_C = new double[n2];

	if (h_C == 0) {

		cerr<<"!!!! host memory allocation error (B)"<<endl;

		return EXIT_FAILURE;

	}

	*/

	// Fill the matrices with test data

    for(int i=0; i<N; i++) {

		for(int j=0; j<N; j++) {

			h_B[IDX2C(i,j,N)] = rand() / (double)RAND_MAX;

            if(i<j) {

				h_A[IDX2C(i,j,N)] = rand() / (double)RAND_MAX;

			}

			else if (i>j) {

				h_A[IDX2C(i,j,N)] = 0.0;

			}

			else if (i==j) {

				h_A[IDX2C(i,j,N)] = 1.0;

			}

			else {

				cerr<<"!!!! Indexing error while filling matrix data"<<endl;

				return EXIT_FAILURE;

			}

		}

	}

status = cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A);

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! device memory allocation error (A)"<<endl;

		return EXIT_FAILURE;

	}

status = cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B);

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! device memory allocation error (B)"<<endl;

		return EXIT_FAILURE;

	}

	status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! device access error (write A)"<<endl;

		return EXIT_FAILURE;

	}

	status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! device access error (write B)"<<endl;

		return EXIT_FAILURE;

	}

cout<<"Printing A: "<<endl;

	print_matrix(h_A,N);

	cout<<"Printing B: "<<endl;

	print_matrix(h_B,N);

	cout<<"Now calling cublasDtrsm"<<endl;

	//Clear last error

	cublasGetError();

	cublasDtrsm('l', 'l', 'n', 'n', N, N, alpha, d_A, N, d_B, N);

	status = cublasGetError();

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! kernel execution error"<<endl;

		return EXIT_FAILURE;

	}

	double *h_C = new double[n2];

	status = cublasGetVector(n2, sizeof(h_B[0]), d_B, 1, h_C, 1);

	if(status != CUBLAS_STATUS_SUCCESS) {

		cerr<<"!!!! devide access error (read B)"<<endl;

		return EXIT_FAILURE;

	}

	cout<<"Printing X: "<<endl;

	print_matrix(h_C,N);

	return 0;

}

I did look at all the support code, so there might be other errors, but your left hand side matrix is unit triangular. You should be passing ‘U’ as the fourth argument to the blas function.

Hey, thanks for the quick reply. I was just testing whether it makes any difference. Passing ‘u’ as the fourth argument makes no difference in output.

I am at wit’s end. The standard sequential code for solving the system A*X=B is as follows:

for(uint j=0; j<N; j=j+1) {

	for(uint k=0; k<N; k=k+1) {

		if(B[j*N+k]!=0) {

			for(uint i=k+1; i<N; i++) {

				B[j*N+i] -= A[k*N+i]*B[j*N+k];

			}

		}

	}	

}

I wrote the sequential code and it turns out that the output is the same as cublas output! Clearly I’m doing something wrong in BOTH.

The way I am representing the matrices A and B in memory is in row-major order. Is that okay?

CUBLAS is column major order, so not it isn’t okay. Either change to column major, or change the second argument to ‘U’ for upper. Then it should work.

Yeah, I think I got it. The cublas call as well as the above sequential code snippet assumes that the matrices are laid out in column major. Both work fine now.

Thanks a bunch!