Hi,
I’ve been playing a bit with ‘transpose’ example from sdk 3.2, and I’m a bit surprised with it’s results.
Namely, none of optimization improvements bits ‘simple copy’ version.
Not even ‘simple copy with shared memory’ ?!
Myy results
> Device 0: "GeForce GT 420M"
> SM Capability 2.1 detected:
> CUDA device has 2 Multi-Processors
> SM performance scaling factor = 12.00
> MatrixSize X = 1024 is greater than the recommended size = 0
> MatrixSize Y = 1024 is greater than the recommended size = 0
Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
transpose-Outer-simple copy , Throughput = 13.0277 GB/s, Time = 0.59968 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-simple copy , Throughput = 31.0014 GB/s, Time = 0.25200 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-shared memory copy, Throughput = 10.7660 GB/s, Time = 0.72566 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-shared memory copy, Throughput = 17.6024 GB/s, Time = 0.44383 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-naive , Throughput = 5.9668 GB/s, Time = 1.30932 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-naive , Throughput = 8.8039 GB/s, Time = 0.88739 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-coalesced , Throughput = 10.4704 GB/s, Time = 0.74615 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-coalesced , Throughput = 14.6268 GB/s, Time = 0.53412 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-optimized , Throughput = 11.6835 GB/s, Time = 0.66868 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-optimized , Throughput = 20.5583 GB/s, Time = 0.38002 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-coarse-grained , Throughput = 11.7487 GB/s, Time = 0.66497 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-coarse-grained , Throughput = 20.5966 GB/s, Time = 0.37931 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-fine-grained , Throughput = 11.8134 GB/s, Time = 0.66133 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-fine-grained , Throughput = 20.4810 GB/s, Time = 0.38145 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Outer-diagonal , Throughput = 8.0603 GB/s, Time = 0.96925 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose-Inner-diagonal , Throughput = 24.0424 GB/s, Time = 0.32495 s,
Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
PASSED
I need to ask, is it because of fermi architecture ??
My system: GT 420m, cuda 3.2.
I was thinking about incorporating it into my code, but if a simple memcopy will do the trick I won’t be bother by it.
Thank you,
Greg