Performance Difference between GTX 770 and Tesla K20m


I’ve been porting a CPU application to CUDA code (Very similar to Conway’s Game of Life, but with more complex neighbour checks, involving double precision calculations), and have been using a GTX 770 (Sole graphics card on Win7 machine, so WDDM is enabled. Driver version 378.49) and a Tesla K20m (One of 4 cards on a Ubuntu 16.04.2 machine. Driver version 381.09) to test.

The 770 has been outperforming the K20m in all of my tests, even when using double precision (Where the K20m should be ~8x faster, according to Wikipedia and the nbody tool in the extras folder with -benchmark -fp64). The 770 is newer by a couple of months, but both are Kepler devices. On paper they seem very close in performance (what the 770 lacks in cores, it makes up for in clock speed), except the double precision is supposed to be much faster on the K20m. I also compiled against compute_35 for the K20m and compute_30 for the 770 (max for both from what I can tell)

I wrote a simple kernel to test this:

__global__ void doubleMult(int n, double *x, double *y)
	int index = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = blockDim.x * gridDim.x;
	for(int i = index; i < n; i+=stride)
		y[i] = x[i] * y[i];

I run this on a vector of 2^25 elements, and the kernel on the 770 takes about 4.5ms vs 5.4 ms on the K20m according to nvprof (average of 10 runs). Even adjusting the operation to an FMA operation with

y[i] = x[i] * y[i] + x[i];

exhibits no change in performance.

Single precision version of the above kernel runs in 2.51 ms (770) vs 2.98 ms (K20m) (10 run average again). In both single and double precision, the 770 experiences a 1.18x speed up over the K20m.

Is there something I’m doing incorrectly here? How do I take full advantage of the K20m?


This doesn’t look like a compute bound code. What’s the ratio of memory bandwidth?

nvprof reports that the doubleMult kernel only uses 1% of the profiling time (63.3% for memcpy DtoH and 35.7% for memcpy HtoD). I figured since I was just trying to get a measure of double precision performance (and was just measuring the kernel time, not any memory operation time) that that wouldn’t’ve mattered so much.

In the application I’m porting, I’ve used pinned memory. The kernel takes 92.1% of the profiling time there with 6.6% for memcpy DtoH and 0.01% for memcpy HtoD (and 1.2% for another set up kernel).

GTX770 has a peak theoretical memory bandwidth of 224GB/s:

Tesla K20m has a peak theoretical memory bandwidth of 208GB/s:

So for memory-bound work, GTX770 should be faster, and the ratio may even be higher than 224/208 (= 1.07) due to the fact that K20m supports ECC and if ECC is on in your case, then K20m will have even lower memory bandwidth available. Your ratio of 1.18 is quite plausible.

Your code is memory bound. I mean specifically your kernel is memory bound. Therefore your kernel timing measurement ratio is approximately equal to the ratio of memory bandwidth.

Just to clarify, memory bandwidth on a GPU refers to the bandwidth from VRAM to a core/SM/etc, not from system RAM, yes?

Ok, that seems to make sense. So I’d need many more calculations with the memory I’m accessing before I’d get to a compute bound kernel rather than a memory bound kernel.


Yes. You can make a “proxy” measurement of it with the CUDA bandwidthTest sample code (the device-to-device bandwidth reported number). Then take the ratio between the two GPUs. If you have ECC enabled, this is probably a better estimate of the actual ratio than my use of published theoretical maximums.

Yes. The canonical “more calculations” code is a matrix-matrix multiply, such as a large CUBLAS gemm call. You should see the K20 pull ahead of the GTX 770 if you compared Sgemm between the two devices. And you’d see the K20 blow the GTX 770’s doors off if you compared Dgemm between the two devices.