Will Cublas support arbitrary (row-major) pitched memory for A, B, C matrices in future?

When I’m doing matrix multiplication with packed matrices, cublas works fine. But when using pitched memory, it produces a different result than packed version. Changing A-B transposedness parameter to CUBLAS_OP_T or CUBLAS_OP_N doesn’t make outputs same as packed version.

I tried changing all permutations of CUBLAS_OP_T , CUBLAS_OP_N, lda = M, lda = pitch of a, ldb = M, ldb = pitch of b, …. but none of them produce the same result as a normal cublas gemm call with CUBLAS_OP_N and CUBLAS_OP_N for A, B, on packed data. So I wrote a custom kernel that produces same result as cublas for both packed and pitched data. I used pitch values to multiply y coordinate when accessing memory and it automatically worked for any pitch and packed versions.

Maybe in future, if cublas supports a user-lambda function that accepts memory access/indexing functor, it could help cublas to avoid manually packing the data before calling gemm (which adds latency and reduces aggregate tflops).

For example, I have a 8192 x 8192 matrix which contains A matrix in its center, and a 512 x 512 matrix as B, and a 8192x8192 matrix which contains C in its center. Then I use these sizes as lda, ldb, ldc and call cublasHgemm and it produces a different result than a packed 512x512 (A,B,C same) version.

For pointers, I compute base pointer + (top-left-X + top-left-Y x pitch) for each (sub)matrix.

The (base) CUBLAS API intends to mostly duplicate the behavior of the Netlib BLAS API. It has a column-major view of the world, and it is certainly capable of picking a submatrix out of a larger matrix.

It is not capable of handling arbitrary line pitch, especially if we assume pitch is in bytes. It can only extract a submatrix from a larger array of full elements, not bytes.

With respect to row major usage, its generally possible using one of several methods, to do the desired operation on a row-major matrix. People often struggle with this. You can find various internet posts describing the approaches. Here are a few: 1 2

That second one includes an article by Peter Wittek (no longer available in its original form) that describes a systematic process to convert a row-major operation into an equivalent column major realization.

Thank you Robert. Those sources will be helpful. I just fill matrices with random data to test so I don’t have any preference over row or column-majorness. But output of cublas will be somehow re-used as inputs of other cublas calls. Only trying to make it work exactly same with any pitched (and mixed) A, B, C matrix layouts.

Peter Wittek’s solution worked. Just switched A and B for non-transpose setting for both. Since non-transpose is intact, performance is at maximum too. Also used pitch values for lda, ldb, ldc.