I am blocking GEMMs on the M dimension (for tall-and-skinny A/C). Assume a cublasDx GEMM like this:
using GEMM = decltype(cublasdx::Size<BLOCK_M, N, K>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::Arrangement<cublasdx::col_major, cublasdx::row_major, cublasdx::row_major>()
[...]
+ cublasdx::LeadingDimension<M, N, N>());
Querying the shared memory tensor for A takes into account the leading dimension of A. Querying the global memory tensor ignores the leading dimension of A. Why? What is the reasoning behind that?
The code from cublasDx in question is this:
template<matrix M, class MemTag>
static constexpr int get_default_ld() {
static_assert(cute::is_same_v<MemTag, smem_tag> or cute::is_same_v<MemTag, gmem_tag>);
int ret = 0;
if constexpr(cute::is_same_v<MemTag, smem_tag>) {
ret = choose<M>(base_type::this_blas_lda, base_type::this_blas_ldb, base_type::this_blas_ldc);
} else {
constexpr arrangement arr = choose<M>(base_type::this_blas_arrangement_a,
base_type::this_blas_arrangement_b,
base_type::this_blas_arrangement_c);
ret = (arr == col_major)
? choose<M>(this_blas_size::m, this_blas_size::k, this_blas_size::m)
: choose<M>(this_blas_size::k, this_blas_size::n, this_blas_size::n);
}
return ret;
}
The leading dimension is useless for the shared memory tensor when copying from global to shared memory but it is kinda important for the global memory tensor. The docs are absolutely not clear on that. I understand the desire to specify custom LDs in get_layout_gmem_*() but I don’t understand why cublasDx purposefully ignores the LDs I have already specified for that specific GEMM and on top of that is inconsistent between global and shared memory.