Manipulate a single element in cublas matrix

silentlearner · November 11, 2009, 2:38pm

I am in the progress of learning cublas. I have set up a matrix A, and tried to assign the value of 100 to element A(2,1) . This is the code I have written:

[codebox]cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

gA[1]=100;

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

The code compile, but I keep getting “segmentation violation” error.

silentlearner · November 12, 2009, 5:10am

I have no problem when I use cublasSscal with alpha=3 (or other constants)

[codebox]#include “mex.h”

#include “cublas.h”

void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs)

{

int m,n;

int dims0[2];

float *A,*gA,*ans;

m=mxGetM(prhs[0]);

n=mxGetN(prhs[0]);

dims0[0]=m;

dims0[1]=n;

plhs[0]=mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

ans=(float*)mxGetData(plhs[0]);

A=(float*)mxGetData(prhs[0]);

cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

(void)cublasSscal(m-1,3,gA,1);

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

But I start to get problem when I use an element in gA instead of a constant for alpha (eg. gA[1])

[codebox]#include “mex.h”

#include “cublas.h”

void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs)

{

int m,n;

int dims0[2];

float *A,*gA,*ans;

m=mxGetM(prhs[0]);

n=mxGetN(prhs[0]);

dims0[0]=m;

dims0[1]=n;

plhs[0]=mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

ans=(float*)mxGetData(plhs[0]);

A=(float*)mxGetData(prhs[0]);

cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

(void)cublasSscal(m-1,gA[1],gA,1);

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

avidday · November 12, 2009, 5:50am

You cannot dereference gA on the host - it is a device pointer, not a host pointer.

silentlearner · November 12, 2009, 8:45am

Thanks for your prompt reply. Now i realize the error for the code in my first post. But what about the 2nd one? Isn’t (void)cublasSscal(m-1,gA[1],gA,1) performed on the device? What changes should i make to the code?

avidday · November 12, 2009, 11:01am

No, cublasSscal is a host function - it launches operations on the device but it is a host fuction. You are still trying to dereference a device pointer and the underlying error is still the same.

silentlearner · November 12, 2009, 12:34pm

Thanks again for your reply. So does it mean that if i want to perform the above operation, i shouldn’t rely on cublas library (since the function are host function), but should write my own kernel instead?

avidday · November 12, 2009, 12:42pm

In the above code, isn’t gA just a copy of A prior to the sscal() call? So wouldn’t it be possible to use the corresponding value of A rather than gA?

silentlearner · November 12, 2009, 1:41pm

Yes, for this case it will work. But what about for cases like Gaussian elimination and LU factorization, which I have to make use of elements in gA? How should cublas be utilized in this case?

avidday · November 12, 2009, 2:25pm

That is really a question only you can answer. If you need intermediate results from the GPU in your host code, you have to copy them back from device memory to host memory (which greatly limits throughput because of PCI-e bus bandwidth and latency). So the algorithm design goal becomes keeping as much data as possible in the GPU for as long as possible and minimizing intermediate data exchange between the host and GPU.

For the example of an LU factorization, it should be obvious that a block factorization algorithm that uses SGEMM and STRSM (ie. Level 3 BLAS) is probably going to perform better than one which uses a lot of level 1 BLAS functions with frequent data exchange between host and device. On that subject, Vasily Volkov from UC Berkley has written a very optimal block LU factorization that uses overlapping host and device BLAS that produces very impressive performance. There is code and papers linked in this post.