Have successfully compiled a version of apex on A100 hardware. But when running the test of fmha. It took 21 seconds to finish. Under my configuration, I just use CUDA 11.4 and anaconda to successfully build apex from source on A100 and found the result is:
(......) ...@...:~/apex/apex/contrib/test/fmha$ python test_fmha.py
Test s=128 b=32
.Test s=256 b=32
.Test s=384 b=32
.Test s=512 b=32
Test s=512 b=2
Test s=512 b=3
.
----------------------------------------------------------------------
Ran 4 tests in 23.213s
OK
What could be the cause? Thanks in advance!