FMHA module of APEX runs very very slow in some cases

Have successfully compiled a version of apex on A100 hardware. But when running the test of fmha. It took 21 seconds to finish. Under my configuration, I just use CUDA 11.4 and anaconda to successfully build apex from source on A100 and found the result is:

(......) ...@...:~/apex/apex/contrib/test/fmha$ python
Test s=128 b=32
.Test s=256 b=32
.Test s=384 b=32
.Test s=512 b=32
Test s=512 b=2
Test s=512 b=3
Ran 4 tests in 23.213s


What could be the cause? Thanks in advance!

Does this question seem to be difficult? It seems to only relate to incorrectly configurated compiling or running parameters.