When I’m doing matrix multiplication with packed matrices, cublas works fine. But when using pitched memory, it produces a different result than packed version. Changing A-B transposedness parameter to CUBLAS_OP_T or CUBLAS_OP_N doesn’t make outputs same as packed version.
I tried changing all permutations of CUBLAS_OP_T , CUBLAS_OP_N, lda = M, lda = pitch of a, ldb = M, ldb = pitch of b, …. but none of them produce the same result as a normal cublas gemm call with CUBLAS_OP_N and CUBLAS_OP_N for A, B, on packed data. So I wrote a custom kernel that produces same result as cublas for both packed and pitched data. I used pitch values to multiply y coordinate when accessing memory and it automatically worked for any pitch and packed versions.
Maybe in future, if cublas supports a user-lambda function that accepts memory access/indexing functor, it could help cublas to avoid manually packing the data before calling gemm (which adds latency and reduces aggregate tflops).
For example, I have a 8192 x 8192 matrix which contains A matrix in its center, and a 512 x 512 matrix as B, and a 8192x8192 matrix which contains C in its center. Then I use these sizes as lda, ldb, ldc and call cublasHgemm and it produces a different result than a packed 512x512 (A,B,C same) version.
For pointers, I compute base pointer + (top-left-X + top-left-Y x pitch) for each (sub)matrix.