Why the transpose speed is much quicker using shared memory inside

I read the demo about transpose in CUDA SDK, and have a question about the speed.

This demo is about a matrix transpose from one global memory to another global memory. It shows that if the data is transposed to a shared memory first and then to the aimed global memory, the program will be much faster.

Why this phenomenons will be happen?
Can you explain it for me?

Thank you

My guess is that they achieve coalesced access when they first transfer the data to shared memory (and later transferring back) which would be hard to achieve by transposing the matrix in place in the device memory. When avoiding bank conflict, shared memory is similarly fast as using registers only.

Yes. it is correct.
Using the shared memory you can achieve coalesced memory access to/from global memory.

Thanks. External Image