Support for per-multiplication m, n, k, lda, ldb, and ldc in batched gemm

Accelerated Computing GPU-Accelerated Libraries

tugrul_192bit January 3, 2026, 1:18pm 1

cublasHgemmBatched has single lda, ldb, ldc values for all multiplications. But in some cases, there can be multiple different sets of lda, ldb, ldc across hundreds of matrices. Is there any cublas support plan for this?

For example, if there are 5000 matrices with 90 unique M, N,K, lda, ldb, ldc parameter sets, is it mandatory to use cuda-streams to run multiple batched gemms in parallel?

Topic		Replies	Views
Excuse me, I would like to ask the following questions about the use of the cublasZgemmBatched function GPU-Accelerated Libraries cublas	1	454	June 26, 2023
Is CublasDX compatible with per-block global-pitch or stride values in a batched-gemm kernel? GPU-Accelerated Libraries cublas	3	47	January 15, 2026
Hundreds of parallel matrix-vector multiplications with cuBLAS GPU-Accelerated Libraries	8	2443	April 8, 2021
cublasSgemmBatched CUDA Programming and Performance	2	2021	May 30, 2015
Using cublasGemmBatchedEx GPU-Accelerated Libraries cuda	2	830	December 23, 2022
multiple matrix-matrix multiplications CUDA Programming and Performance	4	1424	May 21, 2014
SgemmBatched to multiply batched matrix and non batched matrix CUDA Programming and Performance	1	1124	April 16, 2015
Pro Tip: cuBLAS Strided Batched Matrix Multiply Technical Blog	11	1149	February 16, 2018
[Feature request] more flexible cublas<T>gemmStridedBatched CUDA Programming and Performance	1	685	June 12, 2018
Error parameter number 10 cublasSgemmBatched GPU-Accelerated Libraries	4	1886	June 23, 2016

Support for per-multiplication m, n, k, lda, ldb, and ldc in batched gemm

Related topics