Warp serialization might come not from shared memory access. If I am not mistaken, warp-divergent branches may increase that counter too.
I am also not sure how block of dimention 16x16 will be mapped onto warps. I remember in some old CUDA compiler & driver, different values of threadIdx.y meant different warps, so in that case half of all your warps are underutilised. However, this could be already corrected/changed. Just check it out!
My code should only diverge in very rare cases and that is confirmed by the profiler, which reports divergent branches separately and there are only very few (4 divergent branches compared to over 100000 warp serialize reported by the profiler for one kernel call).
When I run the code in emulation mode and print the thread IDs, then they increase as expected in a row-major order (i.e. (x0,y0), … (x15,y0), (x0,y1), …, (x15,y1),…).
I think there are 16 shared bank conflicts per warp. Take look: your threadblocks are 16x16, all devices with capabilities 1.x has 32 threads per warp and 16 shared banks, your warps are formed by 2 rows of 16 threads and data in shared is aligned to 16, so data, , … will be in the same bank. Your access pattern (btid = threadIdx.y * blockDim.x + threadIdx.x) force that threads x0,y0 and x0,y1 (whom belongs to the same warp) access to data and data…same warp, same bank…so you’ve got 16 conflicts per warp for each pattern