My C870 result:
[TransposeNew]
Device 0: “Tesla C870”
SM Capability 1.0 detected:
CUDA device has 16 Multi-Processors
SM performance scaling factor = 1.50
Matrix size: 1536x1536 (48x48 tiles), tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
simple copy 55.17 GB/s 58.62 GB/s
shared memory copy 51.06 GB/s 56.69 GB/s
naive transpose 2.12 GB/s 2.04 GB/s
coalesced transpose 16.34 GB/s 16.86 GB/s
no bank conflict trans 16.98 GB/s 17.44 GB/s
coarse-grained 17.00 GB/s 17.65 GB/s
fine-grained 48.65 GB/s 56.39 GB/s
diagonal transpose 34.94 GB/s 55.46 GB/s
So I didn’t see that abnormal behavior. diagonal transpose is indeed faster. However, when I run it on a GTX 295, the diagonal transpose one is slower than coalesced transpose. Anyone else get this behavior?
BTW, given the fact that my C870 produce different result. I am wondering if it depends on other things like platform, cuda runtime version, etc. My cudart.dll is actually version 2.1.
[TransposeNew]
Device 0: “GeForce GTX 295”
SM Capability 1.3 detected:
CUDA device has 30 Multi-Processors
SM performance scaling factor = 1.00
Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8
Kernel Loop over kernel Loop within kernel
simple copy 89.14 GB/s 63.66 GB/s
shared memory copy 83.74 GB/s 86.36 GB/s
naive transpose 4.23 GB/s 4.21 GB/s
coalesced transpose 62.59 GB/s 69.21 GB/s
no bank conflict trans 70.85 GB/s 70.25 GB/s
coarse-grained 70.92 GB/s 70.21 GB/s
fine-grained 83.66 GB/s 86.05 GB/s
diagonal transpose 63.02 GB/s 66.17 GB/s
Test PASSED