transpose example, SDK 3.2

Hi,

I’ve been playing a bit with ‘transpose’ example from sdk 3.2, and I’m a bit surprised with it’s results.

Namely, none of optimization improvements bits ‘simple copy’ version.

Not even ‘simple copy with shared memory’ ?!

Myy results

> Device 0: "GeForce GT 420M"

> SM Capability 2.1 detected:

> CUDA device has 2 Multi-Processors

> SM performance scaling factor = 12.00

> MatrixSize X = 1024 is greater than the recommended size = 0

> MatrixSize Y = 1024 is greater than the recommended size = 0

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose-Outer-simple copy       , Throughput = 13.0277 GB/s, Time = 0.59968 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-simple copy       , Throughput = 31.0014 GB/s, Time = 0.25200 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-shared memory copy, Throughput = 10.7660 GB/s, Time = 0.72566 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-shared memory copy, Throughput = 17.6024 GB/s, Time = 0.44383 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-naive             , Throughput = 5.9668 GB/s, Time = 1.30932 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-naive             , Throughput = 8.8039 GB/s, Time = 0.88739 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-coalesced         , Throughput = 10.4704 GB/s, Time = 0.74615 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-coalesced         , Throughput = 14.6268 GB/s, Time = 0.53412 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-optimized         , Throughput = 11.6835 GB/s, Time = 0.66868 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-optimized         , Throughput = 20.5583 GB/s, Time = 0.38002 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-coarse-grained    , Throughput = 11.7487 GB/s, Time = 0.66497 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-coarse-grained    , Throughput = 20.5966 GB/s, Time = 0.37931 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-fine-grained      , Throughput = 11.8134 GB/s, Time = 0.66133 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-fine-grained      , Throughput = 20.4810 GB/s, Time = 0.38145 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-diagonal          , Throughput = 8.0603 GB/s, Time = 0.96925 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-diagonal          , Throughput = 24.0424 GB/s, Time = 0.32495 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

PASSED

I need to ask, is it because of fermi architecture ??

My system: GT 420m, cuda 3.2.

I was thinking about incorporating it into my code, but if a simple memcopy will do the trick I won’t be bother by it.

Thank you,

Greg

As I remember it, a lot of the optimizations in the various transpose examples were to work around two architectural limitations of the GT200 and earlier GPUs - “partition camping” of global memory, and shared memory bank conflicts. Fermi is much, much more resistant to both, plus the L1 cache of Fermi cards can make a huge difference to the performance of a lot of memory access pattern limited code.

Hi avidday,
thanks for so quick response!
Never the less, I’d expect at least a bit of an improvement between ‘simple copy’ and ‘simple copy with shared memory’.

Is it a valid assumption, that starting with Fermi shared memory bank conflicts are negligible and not coalesced global memory access doesn’t have a significant impact on overall performance ?

Thank you,
Greg

That is not true for all cases, but in this test case it seems to be. I’m not sure what the L2 cache size is on the GT 420M, but keep in mind that the matrix being transposed is only 1024x1024. Assuming 4 byte elements, that’s 4 MB of data with a cache size that is either 1/10 or 1/5 of the data size. It is possible that the cache is big enough relative to the matrix size to effectively coalesce the writes for you. I’d be curious to see the results if you increase the matrix size by 10x on each dimension. (Or lower, if you don’t have that much memory.)

Thanks all for posting and help.
I found what was wrong. Actually, it’s just my understanding of the results was wrong.
The 2 first ‘simple copy’ and ‘simple copy with shared’ don’t do any transposition. It’s literally memory copy operation.
I’m sorry for misleading.

Thank you all,
Greg