Hi,

I’ve been porting a CPU application to CUDA code (Very similar to Conway’s Game of Life, but with more complex neighbour checks, involving double precision calculations), and have been using a GTX 770 (Sole graphics card on Win7 machine, so WDDM is enabled. Driver version 378.49) and a Tesla K20m (One of 4 cards on a Ubuntu 16.04.2 machine. Driver version 381.09) to test.

The 770 has been outperforming the K20m in all of my tests, even when using double precision (Where the K20m should be ~8x faster, according to Wikipedia and the nbody tool in the extras folder with -benchmark -fp64). The 770 is newer by a couple of months, but both are Kepler devices. On paper they seem very close in performance (what the 770 lacks in cores, it makes up for in clock speed), except the double precision is supposed to be much faster on the K20m. I also compiled against compute_35 for the K20m and compute_30 for the 770 (max for both from what I can tell)

I wrote a simple kernel to test this:

```
__global__ void doubleMult(int n, double *x, double *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for(int i = index; i < n; i+=stride)
y[i] = x[i] * y[i];
}
```

I run this on a vector of 2^25 elements, and the kernel on the 770 takes about 4.5ms vs 5.4 ms on the K20m according to nvprof (average of 10 runs). Even adjusting the operation to an FMA operation with

```
y[i] = x[i] * y[i] + x[i];
```

exhibits no change in performance.

Single precision version of the above kernel runs in 2.51 ms (770) vs 2.98 ms (K20m) (10 run average again). In both single and double precision, the 770 experiences a 1.18x speed up over the K20m.

Is there something I’m doing incorrectly here? How do I take full advantage of the K20m?

Thanks