FMHA module of APEX runs very very slow in some cases

Have successfully compiled a version of apex on A100 hardware. But when running the test of fmha. It took 21 seconds to finish. Under my configuration, I just use CUDA 11.4 and anaconda to successfully build apex from source on A100 and found the result is:

(......) ...@...:~/apex/apex/contrib/test/fmha$ python test_fmha.py
Test s=128 b=32
.Test s=256 b=32
.Test s=384 b=32
.Test s=512 b=32
Test s=512 b=2
Test s=512 b=3
.
----------------------------------------------------------------------
Ran 4 tests in 23.213s

OK

What could be the cause? Thanks in advance!

Does this question seem to be difficult? It seems to only relate to incorrectly configurated compiling or running parameters.