Possible reasons for slower performance with 2.0 over 1.3?

Hello everyone,
I am writing an application that deals with audio processing on the GPU. I have two versions, one which does it in real time, playing the audio after it finishes working on a section, and another which computes the results “offline” and then saves a wave file. I have two machines I am testing on, one with a GeForce GTX 260, which has a compute capability of 1.3 and another which has a Tesla C2050, which has a compute capability of 2.0 (and has much better specs than the GTX 260, such as 1.4 GHz vs. 626 MHz). I compile for arch=13,sm=13 and arch=20,sm=20 for the respective machine. With my real-time version, the C2050 does incredibly well by processing up to 13.5 seconds in real time, while the GTX 260 can do only about 3.9 seconds in real time. However, if I use the offline version, the GTX 260 is a full second faster at about 7 seconds while the C2050 takes about 8 to process everything. Especially with the new cache of the 2.0 architecture, I expected a pretty hefty speed up, since my code would benefit heavily from hardware caching (every iteration it accesses all but one of the same elements as the last iteration, think of it as a threads-per-block sized array “sliding” over another array, moving one element an iteration). Are there certain situations in which the 1.3 architecture does better? I utilize shared memory and texture memory pretty heavily in my code (the real-time version doesn’t use as much texture memory, but still uses it). Is this a possible reason for the slowdown?

Thanks.

Does anyone have any ideas? If not, I’ll let this die, but I wanted to see if anyone missed it during the first go-around.

There might be shared memory bank conflicts now where there were none previously, as the shmem accesses are per warp on Fermi, not per half-warp as previously.

You might want to check the Fermi Tuning Guide if it lists any other reasons for potential slowdowns.