transpose example, SDK 3.2

Glupol · March 15, 2011, 12:39pm

Hi,

I’ve been playing a bit with ‘transpose’ example from sdk 3.2, and I’m a bit surprised with it’s results.

Namely, none of optimization improvements bits ‘simple copy’ version.

Not even ‘simple copy with shared memory’ ?!

Myy results

> Device 0: "GeForce GT 420M"

> SM Capability 2.1 detected:

> CUDA device has 2 Multi-Processors

> SM performance scaling factor = 12.00

> MatrixSize X = 1024 is greater than the recommended size = 0

> MatrixSize Y = 1024 is greater than the recommended size = 0

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose-Outer-simple copy       , Throughput = 13.0277 GB/s, Time = 0.59968 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-simple copy       , Throughput = 31.0014 GB/s, Time = 0.25200 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-shared memory copy, Throughput = 10.7660 GB/s, Time = 0.72566 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-shared memory copy, Throughput = 17.6024 GB/s, Time = 0.44383 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-naive             , Throughput = 5.9668 GB/s, Time = 1.30932 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-naive             , Throughput = 8.8039 GB/s, Time = 0.88739 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-coalesced         , Throughput = 10.4704 GB/s, Time = 0.74615 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-coalesced         , Throughput = 14.6268 GB/s, Time = 0.53412 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-optimized         , Throughput = 11.6835 GB/s, Time = 0.66868 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-optimized         , Throughput = 20.5583 GB/s, Time = 0.38002 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-coarse-grained    , Throughput = 11.7487 GB/s, Time = 0.66497 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-coarse-grained    , Throughput = 20.5966 GB/s, Time = 0.37931 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-fine-grained      , Throughput = 11.8134 GB/s, Time = 0.66133 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-fine-grained      , Throughput = 20.4810 GB/s, Time = 0.38145 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Outer-diagonal          , Throughput = 8.0603 GB/s, Time = 0.96925 s,

Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose-Inner-diagonal          , Throughput = 24.0424 GB/s, Time = 0.32495 s,

 Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

PASSED

I need to ask, is it because of fermi architecture ??

My system: GT 420m, cuda 3.2.

I was thinking about incorporating it into my code, but if a simple memcopy will do the trick I won’t be bother by it.

Thank you,

Greg

avidday · March 15, 2011, 12:48pm

As I remember it, a lot of the optimizations in the various transpose examples were to work around two architectural limitations of the GT200 and earlier GPUs - “partition camping” of global memory, and shared memory bank conflicts. Fermi is much, much more resistant to both, plus the L1 cache of Fermi cards can make a huge difference to the performance of a lot of memory access pattern limited code.

Glupol · March 15, 2011, 2:05pm

Hi avidday,
thanks for so quick response!
Never the less, I’d expect at least a bit of an improvement between ‘simple copy’ and ‘simple copy with shared memory’.

Is it a valid assumption, that starting with Fermi shared memory bank conflicts are negligible and not coalesced global memory access doesn’t have a significant impact on overall performance ?

Thank you,
Greg

seibert · March 15, 2011, 2:44pm

That is not true for all cases, but in this test case it seems to be. I’m not sure what the L2 cache size is on the GT 420M, but keep in mind that the matrix being transposed is only 1024x1024. Assuming 4 byte elements, that’s 4 MB of data with a cache size that is either 1/10 or 1/5 of the data size. It is possible that the cache is big enough relative to the matrix size to effectively coalesce the writes for you. I’d be curious to see the results if you increase the matrix size by 10x on each dimension. (Or lower, if you don’t have that much memory.)

Glupol · March 15, 2011, 3:35pm

Thanks all for posting and help.
I found what was wrong. Actually, it’s just my understanding of the results was wrong.
The 2 first ‘simple copy’ and ‘simple copy with shared’ don’t do any transposition. It’s literally memory copy operation.
I’m sorry for misleading.

Thank you all,
Greg

Topic		Replies	Views
Matrix transpose slower using shared memory CUDA Programming and Performance	5	1000	August 7, 2015
Doubling the speed of the SDK transpose CUDA Programming and Performance	16	6295	October 15, 2008
why am I not seeing bank conflict effects on a gtx 285? CUDA Programming and Performance	3	1568	April 2, 2010
CUDA Memory Transpose CUDA Programming and Performance	10	1880	March 23, 2015
Transpose example, strange dim dependent lagg.. CUDA Programming and Performance	24	12242	October 25, 2009
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2542	October 30, 2020
no bank conflicts on gtx285 or partition camping on quadro nvs 140m? CUDA Programming and Performance	1	3217	March 27, 2010
Matrix Transpose on Titan X CUDA Programming and Performance	1	506	December 23, 2016
Fermi and partition camping CUDA Programming and Performance	16	4871	May 19, 2011
Observed performance difference between gtx 275 285 and 295 A different behavior of gtx 285 CUDA Programming and Performance	5	12491	September 23, 2011

transpose example, SDK 3.2

Related topics