CUBLAS operating on different parts of an array CUBLAS based code development

I have an algorithm that works on a large matrix A (say of size m,n and leading dimension lda). This algorithm boils down to 4 zgemm calls each working on different parts of A. All these cublaszgemm calls can be done concurrently.

In fortran style programming, the whole matrix A is loaded and different indices poinitng at different parts of A are send to the different zgemm calls using FORTRAN’s array’s index e.g. A(1,20) points at the memory location of the 20th column. In cublas, cublasalloc and cublassetmatrix works on a single device pointer pointing at the beginning of A, i.e… devptr_A

Questions:

1 - If I do one cublassetmatrix at the beginning to load the whole matrix A onto the device. How can I pass effectively different parts of A to the 4 different cublaszgemm calls? I tested and it is clear that devptr_A+(lda20) does NOT point at address storing (20lda)th position of array A (in fortran sense A(1,20)). Is there a way to do this?

2 - Alternatively, I can do cublasalloc to get a set of different temporary arrays for each of the 4 cublaszgemm calls at the beginning of the program. Then, for each of the cublaszgemm calls, I call cublassetmatrix to load just the parts of A needed before calling cublaszgemm. I launch all 4 cublaszgemm, then do 4 cublasgetmatrix calls to sync all of them. Would I take a performance lost?

3 - If I use method (2), would it be better if I use cublasSetMatrixAsync and cublasGetMatrixAsync? If I do this, how can I sync all the cublasSetMatrixAsync calls before making the 4 cublaszgemm calls?

Any suggestions will be greatly appreciated.

JL

Actually, cublas also uses column major storage format. So devptr_A + (lda20) will point at address (20lda)th of array A in the Gpu memory

So you should do only one cublasAlloc ( or cudaMalloc ) and use cublasZgemm with different device pointer.

We develop cublas specifically to support addressing of sub-matrix from a bigger one

Actually, cublas also uses column major storage format. So devptr_A + (lda20) will point at address (20lda)th of array A in the Gpu memory

So you should do only one cublasAlloc ( or cudaMalloc ) and use cublasZgemm with different device pointer.

We develop cublas specifically to support addressing of sub-matrix from a bigger one