Cutlass not working in ARM-based machine

Hello, I’m running Megatron-LM in my ARM-based machine with four A100 card, But The performance is not good as X86-based machine, After collecting the performance data using Nvidia NSYS, I find out that cublas-gemm is used in ARM while cutlass-gemm is used in X86,my pytorch version is 1.11 and I don’t know why this weild thing is happen. Is there anything uncompatible with Cublas–CublasLt-cutlass in my arm machine? Can anybody help?

Best Wishes.

After setting the CUBLASLT_LOG_LEVEL=5,I found that all of the heuristicResult is setting to [21] in X86-based machine,while half of the heuristicResult is setting to [0] and [1] in ARM-based machine. Can anybody help?