Why does not have any performance improvement of asynchronous copy on A100 with Cuda Samples?

I am very interested about the potential of new features Asynchronous Copy on A100. I use the official GEMM example to compare tow kernels with and without asynchronous. The experiment results are the following. I use the 40GB A100 with Cuda 12.1.

I believe that the asynchronous copy can save the register and overlap the computation and data movement. However, it seems that there is no improvement. Could you please give me some suggestions about this?