I was surprised to see that the partition camping avoidance kernel in the CUDA Samples transpose example is significantly faster than the optimized (coalesced-and-padded) kernel.
This is on a GTX 680 with the transpose kernels recompiled to use a 32x32 tile size and a 32x8 block size.
Note that on a K20c the optimized kernel is always faster than the partition camping avoidance kernel.
So is this just an example of an access pattern that foils the 680’s hashing scheme? Any ideas?