I have an algorithm that works on a large matrix A (say of size m,n and leading dimension lda). This algorithm boils down to 4 zgemm calls each working on different parts of A. All these cublaszgemm calls can be done concurrently.
In fortran style programming, the whole matrix A is loaded and different indices poinitng at different parts of A are send to the different zgemm calls using FORTRAN’s array’s index e.g. A(1,20) points at the memory location of the 20th column. In cublas, cublasalloc and cublassetmatrix works on a single device pointer pointing at the beginning of A, i.e… devptr_A
1 - If I do one cublassetmatrix at the beginning to load the whole matrix A onto the device. How can I pass effectively different parts of A to the 4 different cublaszgemm calls? I tested and it is clear that devptr_A+(lda20) does NOT point at address storing (20lda)th position of array A (in fortran sense A(1,20)). Is there a way to do this?
2 - Alternatively, I can do cublasalloc to get a set of different temporary arrays for each of the 4 cublaszgemm calls at the beginning of the program. Then, for each of the cublaszgemm calls, I call cublassetmatrix to load just the parts of A needed before calling cublaszgemm. I launch all 4 cublaszgemm, then do 4 cublasgetmatrix calls to sync all of them. Would I take a performance lost?
3 - If I use method (2), would it be better if I use cublasSetMatrixAsync and cublasGetMatrixAsync? If I do this, how can I sync all the cublasSetMatrixAsync calls before making the 4 cublaszgemm calls?