cuBLASDx large matrix multiplication performance

etasnadi · February 5, 2026, 4:04pm

Dear Community,

I have been running the 11_gemm_device_performance/device_gemm_performance example of cuBLASDx, however, it constantly reaches only 20-30% of the cuBLASLt performance. Should not we see similar performance for both?

$ ./11_gemm_device_performance/device_gemm_performance
m, n, k: 8192, 8192, 8192
Compute Type A: real
Compute Type B: real
Compute Type C: real
Dx Input Precision A: half
Dx Input Precision B: half
Dx Input Precision C: float

cuBLASDx
Avg time [ms]  = 6.2206
Avg GFLOP/s  = 176753.2610

cuBLASLt (not including heuristic)
Avg time [ms]  = 1.4454
Avg GFLOP/s  = 760689.5989
Vector reference  norm: [8.16006e+05]
Vector result  norm: [8.16006e+05]
Vector  relative error: [3.85102e-09]
Average relative error: [1.80739e-10]
Maximum relative error: [6.75911e-05]
Maximum absolute error: [3.05176e-05]
Error = 0.0000000039
cuBLAS / cuBLASDx timings = 0.23

I run the experiments on a H-100 unit.

pgrabowski · February 7, 2026, 12:36pm

Hi,

this sample is very sensitive to chosen tile size and block size as well as used operators.

Is your sample using „StaticBlockDim” in description creation? IIRC this was omitted 0.5.0 and added in 0.5.1

H100 will prefer 256 threads with large tiles, if you provide me with specific global sizes and precision as well as arrangements I should be able to assist you in getting required performance.

On H100 the expected performance is over ~90% of cuBLASLt.

etasnadi · February 7, 2026, 7:01pm

Hi, thanks for your reply.

I use the official nvidia-mathdx-25.12.1-cuda12 version pulled from the official download page, so it is probably the 0.5.1 version based on the html documentation and it does use StaticBlockDim:

// cuBLASDx type creation
using BLAS = decltype(cublasdx::Size<tile_m, tile_n, tile_k>() +
                      cublasdx::Precision<a_compute_precision, b_compute_precision, c_compute_precision>() +
                      cublasdx::Type<type>() + cublasdx::Function<cublasdx::function::MM>() +
                      cublasdx::Arrangement<tile_arr_a, tile_arr_b, tile_arr_c>() + cublasdx::Block() +
                      cublasdx::BlockDim<tile_threads>() + cublasdx::StaticBlockDim() +
                      cublasdx::Alignment<cublasdx_alignment, cublasdx_alignment, cublasdx_alignment>() +
                      cublasdx::EnableInputStreaming() + cublasdx::WithPipeline() + cublasdx::SM<Arch, Modifier>());

The arrangement in the example is:

    constexpr auto global_arrangement_a = cublasdx::row_major;
    constexpr auto global_arrangement_b = cublasdx::col_major;
    constexpr auto global_arrangement_c = cublasdx::row_major;

I have not modified the code at all, I was just surprised that the perf difference is so big

The example actually now reaches 47% with m=n=k=8192 when I set CUBLASDX_CUDA_ARCHITECTURES to 90a-real and CMAKE_CUDA_ARCHITECTURES to 90a instead of the more general 90 that I wrongly used in the original experiment. (I guess that the difference is that the former does not set the sm operator modifier cublasdx::arch_specific).

I want to have a good baseline then my end goal is to have optimized batched (possibly also strided) gemm on hunders to thousands of small (m=n=k=128 to 512) matrices possibly outperforming plain old cuBLAS.

Edit: I can reach 85% by only optimizing the tile sizes and number of threads.

Edit #2: setting build type to Release as suggested in the documentation gets 91%.

pgrabowski · February 9, 2026, 7:44am

Hello,

I have experimented with it locally, if you set:

constexpr unsigned tile_m = 128;
constexpr unsigned tile_n = 256;
constexpr unsigned tile_k = 32;

with

constexpr int tile_threads = 256;

and

constexpr unsigned manual_pipeline_depth   = 5;

I get ~91% of SOL performance on H200 with clocks locked for stability and reproducibility. Please let me know if you can recreate these results, happy to assist further if it’s necessary.

Another config that worked for me is 128x256x64 tile with 256 threads and pipeline depth 3

Topic		Replies	Views
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18312	March 30, 2011
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28228	February 1, 2011
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1046	August 23, 2018
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	1221	September 6, 2023
Bad performance of cublas for extremely small matrix multiplication? GPU-Accelerated Libraries cublas	4	1099	May 1, 2024
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1071	February 23, 2017
Slow CUDA SGEMM CUDA Programming and Performance	5	776	September 15, 2022
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	1157	September 5, 2019
CUBLAS 3.0 DGEMM performance on Tesla Fermi CUDA Programming and Performance	1	11473	May 14, 2010
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1555	June 27, 2018

cuBLASDx large matrix multiplication performance

Related topics