Say matrix A1 is m=3 (rows) x k=3 (columns), and just considering the real part:
1 4 7
2 5 8
3 6 9
Then, with column-major format, A1 is stored as an array [1 2 3 4 5 6 7 8 9].
I am interested in the cases like A2 =
1 2 3
2 3 4
3 4 5
That is, where each column is offset by less then the number of rows. Here A2 is stored as an array [1 2 3 4 5]. In terms of the GEMM, we have m=3 and lda=1, where lda is the “leading dimesions”, and determines how much we jump through the array for each column (link).
When lda<m, the total number of elements in the array is N = m + (k-1) x lda. For our example, N = 3 + 2*1 = 5.
So, in @mnicely’s code, I simply change line 339 to:
size_t sizeA = (k-1)*lda+m ; //m * k;
With lda = m (which is the original setting, on line 336), everything if fine (because, sizeA = m*k in this case). But, when I set lda = 8 (the case I am most interested in), I get the following:
CUDA error at …/…/…/cublasLt_C16F_TCs.cu:279 code=7(CUBLAS_STATUS_INVALID_VALUE) “cublasLtMatmul( ltHandle, operationDesc, alpha, Atransform, AtransformDesc, Btransform, BtransformDesc, beta, Ctransform, CtransformDesc, Ctransform, CtransformDesc, nullptr, workSpace, workSpaceSize, stream )”
Line 279 is the cublasLtMatmul call … but I am having no luck tracking down the source of the error. It seems like the new sizeA should handle things.
The documentation for cublasLtMatrixLayoutCreate says that the lead dimension must be >= m … but I think this may be an error (lda<m certainly works with cutlass::gemm::device::GemmBatched).
Appreciate any suggestions.