Performance Difference between GTX 770 and Tesla K20m

Jaguwar · August 3, 2017, 4:02am

Hi,

I’ve been porting a CPU application to CUDA code (Very similar to Conway’s Game of Life, but with more complex neighbour checks, involving double precision calculations), and have been using a GTX 770 (Sole graphics card on Win7 machine, so WDDM is enabled. Driver version 378.49) and a Tesla K20m (One of 4 cards on a Ubuntu 16.04.2 machine. Driver version 381.09) to test.

The 770 has been outperforming the K20m in all of my tests, even when using double precision (Where the K20m should be ~8x faster, according to Wikipedia and the nbody tool in the extras folder with -benchmark -fp64). The 770 is newer by a couple of months, but both are Kepler devices. On paper they seem very close in performance (what the 770 lacks in cores, it makes up for in clock speed), except the double precision is supposed to be much faster on the K20m. I also compiled against compute_35 for the K20m and compute_30 for the 770 (max for both from what I can tell)

I wrote a simple kernel to test this:

__global__ void doubleMult(int n, double *x, double *y)
{
	int index = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = blockDim.x * gridDim.x;
	for(int i = index; i < n; i+=stride)
		y[i] = x[i] * y[i];
}

I run this on a vector of 2^25 elements, and the kernel on the 770 takes about 4.5ms vs 5.4 ms on the K20m according to nvprof (average of 10 runs). Even adjusting the operation to an FMA operation with

y[i] = x[i] * y[i] + x[i];

exhibits no change in performance.

Single precision version of the above kernel runs in 2.51 ms (770) vs 2.98 ms (K20m) (10 run average again). In both single and double precision, the 770 experiences a 1.18x speed up over the K20m.

Is there something I’m doing incorrectly here? How do I take full advantage of the K20m?

Thanks

Robert_Crovella · August 3, 2017, 4:35am

This doesn’t look like a compute bound code. What’s the ratio of memory bandwidth?

Jaguwar · August 3, 2017, 4:56am

nvprof reports that the doubleMult kernel only uses 1% of the profiling time (63.3% for memcpy DtoH and 35.7% for memcpy HtoD). I figured since I was just trying to get a measure of double precision performance (and was just measuring the kernel time, not any memory operation time) that that wouldn’t’ve mattered so much.

In the application I’m porting, I’ve used pinned memory. The kernel takes 92.1% of the profiling time there with 6.6% for memcpy DtoH and 0.01% for memcpy HtoD (and 1.2% for another set up kernel).

Robert_Crovella · August 3, 2017, 1:45pm

GTX770 has a peak theoretical memory bandwidth of 224GB/s:

[url]http://www.nvidia.com/gtx-700-graphics-cards/gtx-770/[/url]

Tesla K20m has a peak theoretical memory bandwidth of 208GB/s:

[url]https://www.techpowerup.com/gpudb/2029/tesla-k20m[/url]

So for memory-bound work, GTX770 should be faster, and the ratio may even be higher than 224/208 (= 1.07) due to the fact that K20m supports ECC and if ECC is on in your case, then K20m will have even lower memory bandwidth available. Your ratio of 1.18 is quite plausible.

Your code is memory bound. I mean specifically your kernel is memory bound. Therefore your kernel timing measurement ratio is approximately equal to the ratio of memory bandwidth.

Jaguwar · August 3, 2017, 11:34pm

Just to clarify, memory bandwidth on a GPU refers to the bandwidth from VRAM to a core/SM/etc, not from system RAM, yes?

Ok, that seems to make sense. So I’d need many more calculations with the memory I’m accessing before I’d get to a compute bound kernel rather than a memory bound kernel.

Thanks!

Robert_Crovella · August 4, 2017, 12:19am

Yes. You can make a “proxy” measurement of it with the CUDA bandwidthTest sample code (the device-to-device bandwidth reported number). Then take the ratio between the two GPUs. If you have ECC enabled, this is probably a better estimate of the actual ratio than my use of published theoretical maximums.

Yes. The canonical “more calculations” code is a matrix-matrix multiply, such as a large CUBLAS gemm call. You should see the K20 pull ahead of the GTX 770 if you compared Sgemm between the two devices. And you’d see the K20 blow the GTX 770’s doors off if you compared Dgemm between the two devices.

Topic		Replies	Views
Tesla C2070 Performance Comparing Tesla C2070 performance to Geforce GTX CUDA Programming and Performance	4	2660	March 24, 2011
GeForce 570 vs. Tesla c2050 CUDA Programming and Performance	3	1853	August 16, 2011
Why is there only a small improvement in performance for a code running on the K20 as compared to th CUDA Programming and Performance	14	2037	July 23, 2013
Buying Advice C2050/C2070 CUDA Programming and Performance	14	9844	August 15, 2010
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9564	November 3, 2013
Tesla k20 vs GTX680 benchmarks...!!!!! CUDA Setup and Installation	6	10006	January 28, 2013
bandwidthTest on different Hardware Performance memory throughput on different hardware CUDA Programming and Performance	0	6478	March 3, 2011
Is this comparison correct? CUDA Programming and Performance	3	985	August 2, 2013
Benchmarking problem CUDA Programming and Performance	5	6442	December 29, 2008
Disappointed performance using C2050 CUDA Programming and Performance	20	8095	September 2, 2010

Performance Difference between GTX 770 and Tesla K20m

Related topics