Transpose kernel slower on GTX280 vs 8800GTX?

I upgraded a while ago and noticed that the transpose kernel (the one from the SDK) takes 400us on my GTX280 for a 1024x1024 matrix transpose as opposed to 300us on a 8800GTX. Why is that?

See slide 36 of:
http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf

This document is from end of 2007 and he got the transpose in 300us as well.

Can somebody confirm that? Or is there something wrong with my card? All my other kernels got a 1.5x-2x speedup from the upgrade so I’m kinda confused…

My application uses transposes a lot (2D RC transformations) so a fast transpose kernel is crucial.

Thanks

Try this transpose, see how perf is affected:

http://forums.nvidia.com/index.php?showtopic=66766&hl=

Don’t be too surprised. The memory subsystem changed so that the number of memory channels to multiprocessors is not 1:1. This is not a problem in and of itself, but code that divided up itself perfectly across memory channels before may find its access pattern to be unbalanced. Memory channels/interleaving is perhaps the biggest element of the gpu architecture that people don’t know about.

Humm,

that isn’t really satisfying. I used the transpose kernel of the link above and got some speedup. Tried to tweak it several times but didn’t get any fast than 330us which is slower than the 300us I got with 8800GTX… I wonder how fast GTX260, 9800 etc. compare… Anybody wants to run the SDK sample for me?

EDIT:
So just plugged out my GTX280 and plugged in my old 8800GTX and voilà my runtime is down to 180us with the optimized kernel… That’s a pretty big difference…