Transpose performance

I’ve found an odd performance effect in the transpose kernel provided with the CUDA SDK. I use transpose at several places in an algorithm and had previously attained about 5ms execution time on image dimensions near 3072x2304x3 (three planes in sequence). However, one of them had always been taking about 15ms but I dismissed it for later study. Some change that I’ve made now makes all of the transposes take nearer 15ms.

So I explored the CUDA SDK example in more detail. I run it on just one plane and got about 5ms (so 15ms for all three planes). However, the performance breaks down in odd ways if I comment out parts of the kernel. (Figures are the best from 20 runs on an 8800 GTX and captured by redirection to a file, to avoid interference from text rendering.)

No reads or writes: 0.27ms
Reads only: 1.06ms
Writes only: 4.67ms
Reads and writes (the whole kernel): 5.11ms

Why are the writes so much slower than the reads? The CUDA Visual Profiler provides some suspicious evidence.

There are 55296 coalesced loads - that makes 30722304/55296 = 128. This presumably measures only one multiprocessor, however, so actually it’s 30722304/8/55296 = the expected 16 floats.

But there are 221184 coalesced stores - 3072*2304/8/221184 = 4 floats. That accounts for the 4x performance disparity between reads and writes, but why are the full 16 float stores not being coalesced completely? (In fact I was not aware that 16-bit stores could be coalesced at all so I’m not sure what’s going on here.)

Edit: This effect seems to be dependent on the reversed block access pattern that’s required in transposition. i.e. the read block order is left-to-right, top-to-bottom while the write block order is top-to-bottom, left-to-right. Making the read statement into a write does not show adverse performance (but still does exhibit the profiling oddity above).