Dear Community,
I have been running the 11_gemm_device_performance/device_gemm_performance example of cuBLASDx, however, it constantly reaches only 20-30% of the cuBLASLt performance. Should not we see similar performance for both?
$ ./11_gemm_device_performance/device_gemm_performance
m, n, k: 8192, 8192, 8192
Compute Type A: real
Compute Type B: real
Compute Type C: real
Dx Input Precision A: half
Dx Input Precision B: half
Dx Input Precision C: float
cuBLASDx
Avg time [ms] = 6.2206
Avg GFLOP/s = 176753.2610
cuBLASLt (not including heuristic)
Avg time [ms] = 1.4454
Avg GFLOP/s = 760689.5989
Vector reference norm: [8.16006e+05]
Vector result norm: [8.16006e+05]
Vector relative error: [3.85102e-09]
Average relative error: [1.80739e-10]
Maximum relative error: [6.75911e-05]
Maximum absolute error: [3.05176e-05]
Error = 0.0000000039
cuBLAS / cuBLASDx timings = 0.23
I run the experiments on a H-100 unit.
Hi,
this sample is very sensitive to chosen tile size and block size as well as used operators.
Is your sample using „StaticBlockDim” in description creation? IIRC this was omitted 0.5.0 and added in 0.5.1
H100 will prefer 256 threads with large tiles, if you provide me with specific global sizes and precision as well as arrangements I should be able to assist you in getting required performance.
On H100 the expected performance is over ~90% of cuBLASLt.
Hi, thanks for your reply.
I use the official nvidia-mathdx-25.12.1-cuda12 version pulled from the official download page, so it is probably the 0.5.1 version based on the html documentation and it does use StaticBlockDim:
// cuBLASDx type creation
using BLAS = decltype(cublasdx::Size<tile_m, tile_n, tile_k>() +
cublasdx::Precision<a_compute_precision, b_compute_precision, c_compute_precision>() +
cublasdx::Type<type>() + cublasdx::Function<cublasdx::function::MM>() +
cublasdx::Arrangement<tile_arr_a, tile_arr_b, tile_arr_c>() + cublasdx::Block() +
cublasdx::BlockDim<tile_threads>() + cublasdx::StaticBlockDim() +
cublasdx::Alignment<cublasdx_alignment, cublasdx_alignment, cublasdx_alignment>() +
cublasdx::EnableInputStreaming() + cublasdx::WithPipeline() + cublasdx::SM<Arch, Modifier>());
The arrangement in the example is:
constexpr auto global_arrangement_a = cublasdx::row_major;
constexpr auto global_arrangement_b = cublasdx::col_major;
constexpr auto global_arrangement_c = cublasdx::row_major;
I have not modified the code at all, I was just surprised that the perf difference is so big
The example actually now reaches 47% with m=n=k=8192 when I set CUBLASDX_CUDA_ARCHITECTURES to 90a-real and CMAKE_CUDA_ARCHITECTURES to 90a instead of the more general 90 that I wrongly used in the original experiment. (I guess that the difference is that the former does not set the sm operator modifier cublasdx::arch_specific).
I want to have a good baseline then my end goal is to have optimized batched (possibly also strided) gemm on hunders to thousands of small (m=n=k=128 to 512) matrices possibly outperforming plain old cuBLAS.
Edit: I can reach 85% by only optimizing the tile sizes and number of threads.
Edit #2: setting build type to Release as suggested in the documentation gets 91%.
Hello,
I have experimented with it locally, if you set:
constexpr unsigned tile_m = 128;
constexpr unsigned tile_n = 256;
constexpr unsigned tile_k = 32;
with
constexpr int tile_threads = 256;
and
constexpr unsigned manual_pipeline_depth = 5;
I get ~91% of SOL performance on H200 with clocks locked for stability and reproducibility. Please let me know if you can recreate these results, happy to assist further if it’s necessary.
Another config that worked for me is 128x256x64 tile with 256 threads and pipeline depth 3