There are multiple gemm computations to do inside a single cuda kernel with each A[i], B[i], C[i] matrix having unique pitch and size. I tried Cublas batched gemm function which doesn’t support variable lda, ldb, ldc across batch so I had to launch multiple cublas batched gemms which reduced performance. Also CublasLt doesn’t havesupport for this. More importantly, sometimes the number of unique pitch values is equal to batch size so it becomes sequential gemm computation or many streams to synchronize with extra overhead.
Can CublasDx work with variable lda, ldb, ldc, matrix sizes across batch in a custom CUDA kernel?
For example:
batch size = 7
gemm 1: lda = 1024, ldb = 1024, ldc = 8192, square matrices size = 512 for A, B, C
gemm 2: lda = 2048, ldb = 1024, ldc = 1024, square matrices size = 512 for A, B, C
...
gemm 7: lda = ldb = ldc = size = 512
lda, ldb, ldc = runtime-known variables
I only need a simple fix per cuda block when accessing global matrix data:
auto index = x + y * pitch; // or x + y * pitch()
Tensor-map descriptors can also take runtime pitch value in bytes for TMA load/store.
If CublasDx can’t index global data with variable pitch, then can I load/store data manually without CublasDx, and use CublasDx only for the multiply-add part?