Is CublasDX compatible with per-block global-pitch or stride values in a batched-gemm kernel?

There are multiple gemm computations to do inside a single cuda kernel with each A[i], B[i], C[i] matrix having unique pitch and size. I tried Cublas batched gemm function which doesn’t support variable lda, ldb, ldc across batch so I had to launch multiple cublas batched gemms which reduced performance. Also CublasLt doesn’t havesupport for this. More importantly, sometimes the number of unique pitch values is equal to batch size so it becomes sequential gemm computation or many streams to synchronize with extra overhead.

Can CublasDx work with variable lda, ldb, ldc, matrix sizes across batch in a custom CUDA kernel?

For example:

batch size = 7
gemm 1: lda = 1024, ldb = 1024, ldc = 8192, square matrices size = 512 for A, B, C
gemm 2: lda = 2048, ldb = 1024, ldc = 1024, square matrices size = 512 for A, B, C
...
gemm 7: lda = ldb = ldc = size = 512

lda, ldb, ldc = runtime-known variables

I only need a simple fix per cuda block when accessing global matrix data:

auto index = x + y * pitch; // or x + y * pitch()

Tensor-map descriptors can also take runtime pitch value in bytes for TMA load/store.

If CublasDx can’t index global data with variable pitch, then can I load/store data manually without CublasDx, and use CublasDx only for the multiply-add part?

I tried compile-time dispatch for runtime-known lda, ldb, ldc values within kernel (for a batch of gemms with various lda, ldb values) and this used 54 GB RAM during compilation and then exited with this error:

./mathdx/nvidia/mathdx/25.12/external/cutlass/include/cute/underscore.hpp(105): catastrophic error: error while writing generated C file: No space left on device

1 catastrophic error detected in the compilation of "main.cu".
Compilation terminated.
make: *** [makefile:18: main] Error 1

after multiple warnings with increasing bit size value like this:

./mathdx/nvidia/mathdx/25.12/external/cutlass/include/cute/algorithm/cooperative_copy.hpp(224): warning #61-D: integer operation result is out of range
      constexpr int total_bits = total_elem * elem_bits;
                                            ^
          detected during:
            instantiation of "void cute::cooperative_copy<... SrcLayout=cute::Layout<cute::tuple<cute::C<16384>, cute::C<16384>>, ....

                                            ^
          detected during:
            instantiation of "void cute::cooperative_copy<... SrcLayout=cute::Layout<cute::tuple<cute::C<32768>, cute::C<32768>>, ....

          detected during:
            instantiation of "void cute::cooperative_copy<... SrcLayout=cute::Layout<cute::tuple<cute::C<65536>, cute::C<65536>>, ....

          detected during:
            instantiation of "void cute::cooperative_copy<... SrcLayout=cute::Layout<cute::tuple<cute::C<larger values here>, .......

but in compile-time dispatch, i’m only changing lda, ldb, ldc values between all instantiations (and there are only 9 instantiations in kernel with lda,ldb values like 1024, 2048 only). I wouldn’t expect it to instantiate the matrix layout sizes that I’m not using.


There’s an execute method that accepts shared-memory tile’s leading-dimension only: Execution Methods — cuBLASDx but I need a global lda, ldb, ldc version for copying inputs and outputs only.


Example codes run fine, I only used one to generate multiple GEMM / BLAS definitions of it inside kernel for 9 times, but time cost for compiling is beyond x9 and above 50GB RAM usage looks like it’s not the intended way of CublasDX usage for variable global lda/ldb/ldc in a batched gemm.

Probably I used CublasDX wrong, like computing full matrix GEMM in a block with 1 execute call. (different matrix sizes and different types exposed as extra template parameters = multiple instantiations + multiplied by number of compile-time dispatch for lda, ldb).

I’ll try using a fixed tile size (instead of full gemm) that can divide all possible matrix sizes, and remove all kinds of template parameters from it and convert to runtime parameters with extra loops to do tiled computation.

Result: cublasdx is meant for tile-only computation. I was trying to compile for 16,32,…,64k sized matrices as a single gemm operation without any tiling (“Supported Memory Spaces“ part in documentation tells its for shared-memory sized tiles). Many examples have only 1 call to execute method, which I misinterpreted as a full gemm operation.