Shared memory atomicAdd is slower than that of using global memory

Hi,

I am testing the shared memory atomicAdd posted in CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics.

I found the latency using shared memory atomicAdd was even 15%~25% slower than using global memory. The entire test code was copied from the blog above. Is there any possible reasons?
My test platform is Jetson Xavier, with ubuntu 16.04 and cuda 10.2.