I’ve got 3 cards (M1060, M2050, GTX480) and I’m running benchmarks on all of them. I’m running a simple matrix multiply, basically the same as provided in the samples, making use of multiple streams. ECC is off on the M2050. I’m seeing about a 30% performance increase from the M2050 → GTX480 in just computation time, which is more than I’d expect for a 12% increase in memory bandwidth and 1 additional MP, but maybe that’s really on par?
What’s bothering me more is that as I scale up on the number of streams, the M2050 overtakes the GTX480. This is timing from the start of all streams copying to device, computing, and copying back. Computation alone the GTX480 always is faster, as expected. The host transfer times are similar so that alone isn’t it. So it seems strictly related to overlapping of multiple streams. But why isn’t the GTX480 seeing the same gains? It’s the same compute capability and as far as I know very similar architecture? The executable I’m running is identical between the platforms.
I’ve got 3 cards (M1060, M2050, GTX480) and I’m running benchmarks on all of them. I’m running a simple matrix multiply, basically the same as provided in the samples, making use of multiple streams. ECC is off on the M2050. I’m seeing about a 30% performance increase from the M2050 → GTX480 in just computation time, which is more than I’d expect for a 12% increase in memory bandwidth and 1 additional MP, but maybe that’s really on par?
What’s bothering me more is that as I scale up on the number of streams, the M2050 overtakes the GTX480. This is timing from the start of all streams copying to device, computing, and copying back. Computation alone the GTX480 always is faster, as expected. The host transfer times are similar so that alone isn’t it. So it seems strictly related to overlapping of multiple streams. But why isn’t the GTX480 seeing the same gains? It’s the same compute capability and as far as I know very similar architecture? The executable I’m running is identical between the platforms.
The GTX 480 has both an additional MP relative to the M2050, and the shader clock rate is boosted from 1.15 GHz to 1.4 GHz. That adds up to a 30% improvement if you are completely compute bound.
My guess (based on this post The Official NVIDIA Forums | NVIDIA) is that you are seeing the benefit of the extra DMA engine on Tesla. The GeForce can overlap a single device-to-host or host-to-device transfer with computation on different streams, but the M2050 has two DMA engines, so it can perform transfers in both directions while running calculations.
The GTX 480 has both an additional MP relative to the M2050, and the shader clock rate is boosted from 1.15 GHz to 1.4 GHz. That adds up to a 30% improvement if you are completely compute bound.
My guess (based on this post The Official NVIDIA Forums | NVIDIA) is that you are seeing the benefit of the extra DMA engine on Tesla. The GeForce can overlap a single device-to-host or host-to-device transfer with computation on different streams, but the M2050 has two DMA engines, so it can perform transfers in both directions while running calculations.
Completely forgot about that extra DMA engine! That is definitely it, thanks! And thanks for the info about the shader clock rate, between the two it sounds like the difference I noted. Appreciate your insight.
Completely forgot about that extra DMA engine! That is definitely it, thanks! And thanks for the info about the shader clock rate, between the two it sounds like the difference I noted. Appreciate your insight.