Hello, I’m running Megatron-LM in my ARM-based machine with four A100 card, But The performance is not good as X86-based machine, After collecting the performance data using Nvidia NSYS, I find out that cublas-gemm is used in ARM while cutlass-gemm is used in X86,my pytorch version is 1.11 and I don’t know why this weild thing is happen. Is there anything uncompatible with Cublas–CublasLt-cutlass in my arm machine? Can anybody help?
Best Wishes.