New transpose white paper Partition camping explained

We noticed today that we left out a very good white paper by Greg Ruetsch and Paulius Micikevicius from some versions of the new SDK release, so until we have a chance to update the SDK package, I’ve attached the white paper to this post.

It explains the new transpose sample included with 2.2–it goes into much more detail on partition camping than anything else I’ve seen. I think that many of you will find it very interesting and probably helpful for optimizing your own applications.
MatrixTranspose.pdf (1.98 MB)

Interesting read, for sure. Are there hardware-counters that can be added to the profiler, to detect if one is suffering from partition camping?

Or to phrase it another way “I would like to be able to detect partition camping automatically.” ;)

Too bad I’m not bandwidth limited…

I completely agree with T.B on that…if some counters are available in the profiler to measure and detect partition camping that would be really nice…

i don’t see anything of this sort in the new 2.2 profiler ?? (any one else had any luck with this…??)

Also, one more thing…these optimizations still seems to pretty much tied to the hardware…

transposenew on the c870, the bandwidth is still pretty low - for example :

Matrix size: 2048x2048, tile size: 32x32, block size: 32x8


Kernel Loop over kernel Loop within kernel

simple copy 53.97 GB/s 57.86 GB/s

shared memory copy 50.42 GB/s 59.81 GB/s

naive transpose 1.78 GB/s 1.38 GB/s

coalesced transpose 18.55 GB/s 35.06 GB/s

no bank conflict trans 18.82 GB/s 35.42 GB/s

coarse-grained 18.81 GB/s 35.28 GB/s

fine-grained 48.01 GB/s 59.60 GB/s

diagonal transpose 10.74 GB/s 16.36 GB/s


can anyone else confirm this ??

any explanation for this ?? diagonal transpose is slower ??

There are no hard and fast ways to detect partition camping, which is part of the problem. I’m not sure why it’s so much slower on C870, though.

My C870 result:


Device 0: “Tesla C870”

SM Capability 1.0 detected:

CUDA device has 16 Multi-Processors

SM performance scaling factor = 1.50

Matrix size: 1536x1536 (48x48 tiles), tile size: 32x32, block size: 32x8

Kernel Loop over kernel Loop within kernel

simple copy 55.17 GB/s 58.62 GB/s

shared memory copy 51.06 GB/s 56.69 GB/s

naive transpose 2.12 GB/s 2.04 GB/s

coalesced transpose 16.34 GB/s 16.86 GB/s

no bank conflict trans 16.98 GB/s 17.44 GB/s

coarse-grained 17.00 GB/s 17.65 GB/s

fine-grained 48.65 GB/s 56.39 GB/s

diagonal transpose 34.94 GB/s 55.46 GB/s

So I didn’t see that abnormal behavior. diagonal transpose is indeed faster. However, when I run it on a GTX 295, the diagonal transpose one is slower than coalesced transpose. Anyone else get this behavior?

BTW, given the fact that my C870 produce different result. I am wondering if it depends on other things like platform, cuda runtime version, etc. My cudart.dll is actually version 2.1.


Device 0: “GeForce GTX 295”

SM Capability 1.3 detected:

CUDA device has 30 Multi-Processors

SM performance scaling factor = 1.00

Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8

Kernel Loop over kernel Loop within kernel

simple copy 89.14 GB/s 63.66 GB/s

shared memory copy 83.74 GB/s 86.36 GB/s

naive transpose 4.23 GB/s 4.21 GB/s

coalesced transpose 62.59 GB/s 69.21 GB/s

no bank conflict trans 70.85 GB/s 70.25 GB/s

coarse-grained 70.92 GB/s 70.21 GB/s

fine-grained 83.66 GB/s 86.05 GB/s

diagonal transpose 63.02 GB/s 66.17 GB/s