I just ran the SDK examples on a C2050 to see the difference between the C1060 and at 1st, didn’t see much.
I noticed the device to device memory bandwidth were comparable: 73Gb/s for C1060 and 78Gb/s for C2050. Then I saw the monitor was attached to the Tesla, which I thought would use the GPU significantly. I disabled that monitor in Windows and made a Quadro 290 the main display, and I don’t know if that improved performance - memory bandwidth went to 79Gb/s (maybe - since monitor refresh consumes very little) but compute performance for things like convolutionFFT2D didn’t improve.
Then, I saw you can disabled memory error correction and the bandwidth went up to 90Gb/s, along with compute performance for most applications.
I still want to know: how significant is the impact of the display output on Tesla C2050 on CUDA perforamance?