I read the demo about transpose in CUDA SDK, and have a question about the speed.
This demo is about a matrix transpose from one global memory to another global memory. It shows that if the data is transposed to a shared memory first and then to the aimed global memory, the program will be much faster.
Why this phenomenons will be happen?
Can you explain it for me?