[cublasdx] leading dimension for global memory tensor

joseph.schuchart · April 18, 2025, 4:50pm

I am blocking GEMMs on the M dimension (for tall-and-skinny A/C). Assume a cublasDx GEMM like this:

      using GEMM = decltype(cublasdx::Size<BLOCK_M, N, K>()
                          + cublasdx::Function<cublasdx::function::MM>()
                          + cublasdx::Arrangement<cublasdx::col_major, cublasdx::row_major, cublasdx::row_major>()
                             [...]
                          + cublasdx::LeadingDimension<M, N, N>());

Querying the shared memory tensor for A takes into account the leading dimension of A. Querying the global memory tensor ignores the leading dimension of A. Why? What is the reasoning behind that?

The code from cublasDx in question is this:

            template<matrix M, class MemTag>
            static constexpr int get_default_ld() {
                static_assert(cute::is_same_v<MemTag, smem_tag> or cute::is_same_v<MemTag, gmem_tag>);
                int ret = 0;
                if constexpr(cute::is_same_v<MemTag, smem_tag>) {
                    ret = choose<M>(base_type::this_blas_lda, base_type::this_blas_ldb, base_type::this_blas_ldc);
                } else {
                    constexpr arrangement arr = choose<M>(base_type::this_blas_arrangement_a,
                                                          base_type::this_blas_arrangement_b,
                                                          base_type::this_blas_arrangement_c);
                    ret = (arr == col_major)
                        ? choose<M>(this_blas_size::m, this_blas_size::k, this_blas_size::m)
                        : choose<M>(this_blas_size::k, this_blas_size::n, this_blas_size::n);
                }
                return ret;
            }

The leading dimension is useless for the shared memory tensor when copying from global to shared memory but it is kinda important for the global memory tensor. The docs are absolutely not clear on that. I understand the desire to specify custom LDs in get_layout_gmem_*() but I don’t understand why cublasDx purposefully ignores the LDs I have already specified for that specific GEMM and on top of that is inconsistent between global and shared memory.

Topic		Replies	Views
A newbie question on cublasSgemm CUDA Programming and Performance	6	4946	May 14, 2008
Resource usage & optimization read a cubin file... CUDA Programming and Performance	4	2243	August 4, 2008
cublasSgemm on Submatrices iteratively CUDA Programming and Performance	0	697	January 4, 2012
Incorrect results from cublasSgemm CUDA Programming and Performance	7	8389	June 25, 2009
cublasLtMatmul with leading dimension (lda) < rows (m) CUDA Programming and Performance	3	734	April 8, 2020
Strange ptxas error in shared memory CUDA Programming and Performance	7	9222	February 24, 2009
CUBLAS matrix indexing for IDL CUDA Programming and Performance	6	3507	September 15, 2008
small problem with cublas sgemv CUDA Programming and Performance	4	4161	February 17, 2012
limit of computation CUDA Programming and Performance	44	33060	April 8, 2008
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10240	March 24, 2009

[cublasdx] leading dimension for global memory tensor

Related topics