In the Best Practice Guide (Cuda Toolkit 2.3) page29
first paragraph below the Figure 3.11 explains memory access
in that case each thead in a half warp access same device memory address when they try to read the value in matrix A.
so i thought that 16threads are fully serialized so 8X speed degradation occurs.(4byte out of 32byte transaction)
but the author explain that
" 16 transaction for compute capability 1.1 or lower and 1transaction for compute capability 1.2 or higher"
i know that compute capabilty 1.2 or higher gpu can coalesed access for misaligned access pattern.
but in this case all thread must read the same address. Is it really acomplished by only 1 transaction in 1.2 or higher?