I’m running the transpose example in the cuda samples of the 6.5 Toolkit (I’ve also run the sample code from the ParallelForAll bog post ‘Efficient Matrix Tanspose’) on my laptop pc that has a GT 620M (Fermi) card. The results I see for the ‘optimized’ implementations using shared memory have lower throughput than the ‘naive’ implementations and don’t match the throughput of the ‘simple copy’ implementation. Below are my outputs:
Device 0: “GeForce GT 620M”
SM Capability 2.1 detected:
[GeForce GT 620M] has 2 MP(s) x 48 (Cores/MP) = 96 (Cores)
Compute performance scaling factor = 2.00
Matrix size: 512x512 (16x16 tiles), tile size: 32x32, block size: 32x8
transpose simple copy , Throughput = 4.8768 GB/s, Time = 0.40049 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 2.2575 GB/s, Time = 0.86518 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive , Throughput = 2.2181 GB/s, Time = 0.88053 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced , Throughput = 2.6276 GB/s, Time = 0.74331 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized , Throughput = 3.1283 GB/s, Time = 0.62434 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained , Throughput = 3.1307 GB/s, Time = 0.62385 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained , Throughput = 3.3956 GB/s, Time = 0.57519 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose diagonal , Throughput = 2.9128 GB/s, Time = 0.67053 ms, Size
= 262144 fp32 elements, NumDevsUsed = 1, Workgroup = 256
Test passed
I’ve also run these examples on another (older) card: GTx 550 Ti and see similar results. I know these are older cards, but I am wondering why the optimizations using shared memory are not optimal.
Has this been reported in another post? I could not find any such entry.
What gives - why don’t the shared implementations match the simple copy throughput?
Thanks