Cutlass not working in ARM-based machine

834352945 · April 11, 2023, 12:07pm

Hello, I’m running Megatron-LM in my ARM-based machine with four A100 card, But The performance is not good as X86-based machine, After collecting the performance data using Nvidia NSYS, I find out that cublas-gemm is used in ARM while cutlass-gemm is used in X86，my pytorch version is 1.11 and I don’t know why this weild thing is happen. Is there anything uncompatible with Cublas–CublasLt-cutlass in my arm machine? Can anybody help?

Best Wishes.

834352945 · April 12, 2023, 3:34am

After setting the CUBLASLT_LOG_LEVEL=5，I found that all of the heuristicResult is setting to [21] in X86-based machine，while half of the heuristicResult is setting to [0] and [1] in ARM-based machine. Can anybody help?

Topic		Replies	Views
cuBLAS severe underperformance on cublasSgemm for RTX 3060 Laptop GPU GPU-Accelerated Libraries cublas , cutlass	1	49	May 14, 2026
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	1055	August 23, 2018
cuBLAS works with 11.2, but not with 11.3 on RTX 3080 Mobile. On A100 both work GPU-Accelerated Libraries cublas	3	1474	October 12, 2021
cublasGemmEx execution error code CUBLAS_STATUS_ARCH_MISMATCH GPU-Accelerated Libraries	1	1557	January 7, 2020
CUBLAS_STATUS_ARCH_MISMATCH for cublasHgemm at VisualStudio 2013 + Gtx1080 CUDA Setup and Installation	1	990	June 20, 2016
cublasZgemm fails on FERMI but not on TESLA CUBLAS_STATUS_NOT_INITIALIZED even if 'cublasInit()& CUDA Programming and Performance	2	5961	February 17, 2011
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30568	December 20, 2009
Just Released: CUTLASS 3.8 Technical Blog	1	402	February 4, 2025
[Solved]Same Cublas Functions work slower on the GTX1080 from GTX 960M GPU-Accelerated Libraries	3	922	June 5, 2018
CUBLAS performance under toolkit 4.1 CUDA Programming and Performance	2	4736	March 16, 2012

Cutlass not working in ARM-based machine

Related topics