Transpose performance

jcornwall · July 11, 2008, 4:57pm

I’ve found an odd performance effect in the transpose kernel provided with the CUDA SDK. I use transpose at several places in an algorithm and had previously attained about 5ms execution time on image dimensions near 3072x2304x3 (three planes in sequence). However, one of them had always been taking about 15ms but I dismissed it for later study. Some change that I’ve made now makes all of the transposes take nearer 15ms.

So I explored the CUDA SDK example in more detail. I run it on just one plane and got about 5ms (so 15ms for all three planes). However, the performance breaks down in odd ways if I comment out parts of the kernel. (Figures are the best from 20 runs on an 8800 GTX and captured by redirection to a file, to avoid interference from text rendering.)

No reads or writes: 0.27ms
Reads only: 1.06ms
Writes only: 4.67ms
Reads and writes (the whole kernel): 5.11ms

Why are the writes so much slower than the reads? The CUDA Visual Profiler provides some suspicious evidence.

There are 55296 coalesced loads - that makes 30722304/55296 = 128. This presumably measures only one multiprocessor, however, so actually it’s 30722304/8/55296 = the expected 16 floats.

But there are 221184 coalesced stores - 3072*2304/8/221184 = 4 floats. That accounts for the 4x performance disparity between reads and writes, but why are the full 16 float stores not being coalesced completely? (In fact I was not aware that 16-bit stores could be coalesced at all so I’m not sure what’s going on here.)

Edit: This effect seems to be dependent on the reversed block access pattern that’s required in transposition. i.e. the read block order is left-to-right, top-to-bottom while the write block order is top-to-bottom, left-to-right. Making the read statement into a write does not show adverse performance (but still does exhibit the profiling oddity above).

Topic		Replies	Views
Kernel has 0 coalesced reads/writes... Profiler reveals my newbness CUDA Programming and Performance	1	1112	February 18, 2009
Why i get performance in this Kernel CUDA Programming and Performance	3	1637	July 13, 2008
simple element wise addition more stores than reads? CUDA Programming and Performance	2	1583	August 28, 2008
3D Transpose ... and memory coalescence CUDA Programming and Performance	3	5283	January 26, 2008
Why coalesced loads and writes? CUDA Programming and Performance	2	1318	April 8, 2009
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12376	October 25, 2009
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2768	September 18, 2009
3D transpose, weird behaviour CUDA Programming and Performance	4	4250	April 18, 2009
Coalescing access CUDA Programming and Performance	3	792	March 2, 2012
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6152	October 29, 2008

Transpose performance

Related topics