8-byte shared memory bandwidth on Kepler

Does anyone have an example where they have actually got 8-byte shared memory bandwidth on Kepler? I have a kernel which I suspect is (in places at least) limited by shared memory bandwidth. Having counted all of the shared memory reads/writes and compared with the kernel execution time I’m achieving an average of 540GB/s on a GTX 670 (which has a theoretical maximum of about 1750GB/s). I was wondering if it would be worthwhile trying to use 8-byte transactions. My kernel is mostly convolutions with 3 x 3 x 3 separable filters and it seems like it might be quite similar to the “TTI Reverse Time Migration” case study in “GPU Performance Analysis and Optimization” by Paulius Micikevicius.