C2070 VS. K20

I am using GPU to do Monte Carlo analysis (the samplings of Monte Carlo analysis are independent), and using one GPU thread for one sampling.
In my application, the performance of one C2070 is about 2~4 times comparing with two E5606 (8 cores) when the GPU is in full occupation (6k samplings). This data may be not good enough to make a further decision. So I want to know if I replace with K20 (or K20X?), and running in full occupation (20k samplings for K20?), what is the expectation?
Did anyone has experience on it?

It’s really hard to say if you will see an improvement. If it is feasible, I would take a look at the GPU Test Drive program that NVIDIA offers:


Keep in mind that you will definitely need to try different block configurations to maximize performance. The block configurations you use for the C2070 will almost certainly give you lower performance on Kepler. (This is probably the number #1 source of performance complaints when people switch.)

Hi Seibert,
If I run a large number of sampling (say 20k), could K20X be 6X faster than C2070 when they are all run at the best configurations ? Since K20X’ SP number (2688) is 6X times comparing with C2070 (448) ? Or I made a wrong understanding of SP here?

I do not think it is so simple to get a 6x speed-up. The cores from K20 are different from the cores in C2070. Based on the theoretical GFLOPS I would expect a 2x speed-up.

Hi Pasoleatis,
Thanks! Several months ago I got a chance to do some testing on K20, found it has about 1.1~1.7X improvement comparing with C2070 of my application. I thought it didn’t meet my expectation (because at that time I thought it should be about 6X basing on the difference of SP number) but didn’t get time to look into it. But now this seems reasonable.

Aside from all the other architectural differences between Fermi and Kepler, it is important to realize that the clock rate of the CUDA cores is significantly lower on Kepler than Fermi. In order to increase power efficiency, NVIDIA increased the number of CUDA cores and decreased the clock rate, which gives you the ~2x performance difference while staying in the same power envelope.