Manipulate a single element in cublas matrix

I am in the progress of learning cublas. I have set up a matrix A, and tried to assign the value of 100 to element A(2,1) . This is the code I have written:

[codebox]cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

gA[1]=100;

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

The code compile, but I keep getting “segmentation violation” error.

I have no problem when I use cublasSscal with alpha=3 (or other constants)

[codebox]#include “mex.h”

#include “cublas.h”

void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs)

{

int m,n;

int dims0[2];

float *A,*gA,*ans;

m=mxGetM(prhs[0]);

n=mxGetN(prhs[0]);

dims0[0]=m;

dims0[1]=n;

plhs[0]=mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

ans=(float*)mxGetData(plhs[0]);

A=(float*)mxGetData(prhs[0]);

cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

(void)cublasSscal(m-1,3,gA,1);

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

But I start to get problem when I use an element in gA instead of a constant for alpha (eg. gA[1])

[codebox]#include “mex.h”

#include “cublas.h”

void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs)

{

int m,n;

int dims0[2];

float *A,*gA,*ans;

m=mxGetM(prhs[0]);

n=mxGetN(prhs[0]);

dims0[0]=m;

dims0[1]=n;

plhs[0]=mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

ans=(float*)mxGetData(plhs[0]);

A=(float*)mxGetData(prhs[0]);

cublasInit();

cublasAlloc(m*n,sizeof(float),(void**)&gA);

cudaMemset(gA,0,mn4);

cublasSetMatrix(m,n,sizeof(float),A,m,(void*)gA,m);

(void)cublasSscal(m-1,gA[1],gA,1);

cublasGetMatrix(m,n,sizeof(float),gA,m,ans,m);

cublasFree(gA);

}[/codebox]

You cannot dereference gA on the host - it is a device pointer, not a host pointer.

Thanks for your prompt reply. Now i realize the error for the code in my first post. But what about the 2nd one? Isn’t (void)cublasSscal(m-1,gA[1],gA,1) performed on the device? What changes should i make to the code?

No, cublasSscal is a host function - it launches operations on the device but it is a host fuction. You are still trying to dereference a device pointer and the underlying error is still the same.

Thanks again for your reply. So does it mean that if i want to perform the above operation, i shouldn’t rely on cublas library (since the function are host function), but should write my own kernel instead?

In the above code, isn’t gA just a copy of A prior to the sscal() call? So wouldn’t it be possible to use the corresponding value of A rather than gA?

Yes, for this case it will work. But what about for cases like Gaussian elimination and LU factorization, which I have to make use of elements in gA? How should cublas be utilized in this case?

That is really a question only you can answer. If you need intermediate results from the GPU in your host code, you have to copy them back from device memory to host memory (which greatly limits throughput because of PCI-e bus bandwidth and latency). So the algorithm design goal becomes keeping as much data as possible in the GPU for as long as possible and minimizing intermediate data exchange between the host and GPU.

For the example of an LU factorization, it should be obvious that a block factorization algorithm that uses SGEMM and STRSM (ie. Level 3 BLAS) is probably going to perform better than one which uses a lot of level 1 BLAS functions with frequent data exchange between host and device. On that subject, Vasily Volkov from UC Berkley has written a very optimal block LU factorization that uses overlapping host and device BLAS that produces very impressive performance. There is code and papers linked in this post.