I’ve got 3 cards (M1060, M2050, GTX480) and I’m running benchmarks on all of them. I’m running a simple matrix multiply, basically the same as provided in the samples, making use of multiple streams. ECC is off on the M2050. I’m seeing about a 30% performance increase from the M2050 -> GTX480 in just computation time, which is more than I’d expect for a 12% increase in memory bandwidth and 1 additional MP, but maybe that’s really on par?
What’s bothering me more is that as I scale up on the number of streams, the M2050 overtakes the GTX480. This is timing from the start of all streams copying to device, computing, and copying back. Computation alone the GTX480 always is faster, as expected. The host transfer times are similar so that alone isn’t it. So it seems strictly related to overlapping of multiple streams. But why isn’t the GTX480 seeing the same gains? It’s the same compute capability and as far as I know very similar architecture? The executable I’m running is identical between the platforms.