The performance result of CUDA sample globalToShmemAsyncCopy is puzzled

I’ve tested this sample on the 2070s and 3090, both devices show that the Naive kernel is the fastest. Does this sample only introduce the API usage, not the benefits?

Here are logs on 3090:

-----------------------------SM 86-----------------------------------------------------------------------
Running kernel = 0 - AsyncCopyMultiStageLargeChunk
Computing result using CUDA Kernel…
done
Performance= 2715.96 GFlop/s, Time= 50.604 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 1 - AsyncCopyLargeChunk
Computing result using CUDA Kernel…
done
Performance= 2721.41 GFlop/s, Time= 50.503 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 2 - AsyncCopyLargeChunkAWBarrier
Computing result using CUDA Kernel…
done
Performance= 1641.61 GFlop/s, Time= 83.722 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 3 - AsyncCopyMultiStageSharedState
Computing result using CUDA Kernel…
done
Performance= 1815.21 GFlop/s, Time= 75.715 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 4 - AsyncCopyMultiStage
Computing result using CUDA Kernel…
done
Performance= 2678.29 GFlop/s, Time= 51.316 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 5 - AsyncCopySingleStage
Computing result using CUDA Kernel…
done
Performance= 2688.78 GFlop/s, Time= 51.116 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 6 - Naive
Computing result using CUDA Kernel…
done
Performance= 3129.04 GFlop/s, Time= 43.924 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 7 - NaiveLargeChunk
Computing result using CUDA Kernel…
done
Performance= 3050.63 GFlop/s, Time= 45.053 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.