Kepler vs GT200: pure computational performance issue

Hi, all
I’m starting this topic with no console test output and examples. I just want to describe the problem to get your feedback if you concerned about it or may be faced it before.

What I’m trying to test is the computational performance of GTX690 (single die). Which has ~1500 cuda cores. I simply fill an array of double/floats of considerable size (Nt=4096 x Nx=8192) and then for each value I perform an operation of type
res+=1.0/(1.0+res*res)
let’s say C(C=10) times. I see that it saturate the computational cores. It is indicated by significant bandwidth drop and good (proportional) time scaling with respect to C.
I used different geometries:
<<<(Nt/32,Nx/16,1),(32,16,1)>>>
or cycled version with
<<<32,Nx>>>, <<<128,Nx>>> doing the math for Nt lines Nt/32 or Nt/128 times. (I use coalesced reading of course.)
What I see is the maximum performance of GK104 (GTX690) is about 1.5~2 of G80 (GTX285).
While the number of cores is 6 times greater!
I also decided to limit the occupancy of the kernel simply by adding smem and filling it with zeros:
a)
shared float smem[8192];
smem[threadIdx.x]=0.0;
and
b)
shared float smem[8192/16];
smem[threadIdx.x]=0.0;

What I see is performance drop by ~6 times for a) and almost no performance drop for b) (since in b smem does not affect the occupancy in this case)

btw! for C=0 the bandwidth is about ~140 Gb/s for <<<128,Nx>>> variant and <<<(Nt/32,Nx/16,1),(32,16,1)>>> variant. Not very good but bandwidth test from cuda samples gives 149 Gb/s thus I suppose it is the reference point. I mean the code it self unleashes all bandwidth and stuck at math.
What do you think about it? I will be appreciate any response on this problem for G80, GF104, GF110, GK104, GK110 arch-s.

If any of you need I will post a test code.

One thing to be careful about is your single/double precision values. You are doing a lot of double precision math there, which is notoriously poor on nearly all GeForce cards (especially the 600 series). While there may be 6 times as many cores, the structure of those cores is significantly different between the architectures (even Kepler from Fermi was a big jump). This means it is not surprising to not see such a simple relationship.

First thing I would do is to decide whether you are testing double or single precision. If single, then change all your 0.0s to 0.0f, and equivalent values for other constants.

Please note that the topic title isn’t quite right.
GTX285 is based on a second generation GT200 chip, not on G80. GTX8800 was a G80.

Given that you are implicitly doing double precision arithmetic, the scaling is quite different than you are calculating. Two major things have changed since the GTX 285:

  1. The shader clock rate has dropped 30%, although the GTX 690 has the ability to boost its clock when the heat load allows. (I’ll assume the boosted clock below.)
  2. The ratio of single to double precision throughput in GeForce cards has changed dramatically. (Tesla hasn’t changed as much.)

GTX 285: 240 CUDA cores * 1476 MHz / 8 (double precision) = 4.4e9 DP instructions/sec

GTX 690 (1 die): 1536 CUDA cores * 1019 MHz (boost clock) / 24 (double precision) = 6.5e9 DP instructions/sec

Ratio: 1.5x

So, while there has been pretty large change in the single precision throughput (~4x in just over 3 years), the double precision performance has only grown 50% in the GeForce line.

2 Detlef Roettger:
Thnx I’ve renamed the title.

2 Tiomat:
Thnx! I forgot about one double constant! Now I see that my fp code is ~>16 faster than dp (… 16, not 30 on 285 or 64 on 690).

2 Tiomat & seibert:
I’m still confused!
fp code took 60.34 ms on GTX285 and 27.00 ms on GTX690
I use arch sm_13 and sm_30 correspondingly and --use_fast_math to neglect all these IEEE for GPU.

the code:
global void
calc_Kernel(float* A, float* B, int Nt, int Nx, int C)
{
const int tidx=threadIdx.x+blockIdx.xblockDim.x;
const int tidy=threadIdx.y+blockIdx.y
blockDim.y;
const int tid=tidx+tidyNt;
register float un=1.0f;
if((tidx<Nt)&(tidy<Nx)){
register float V=A[tid];
register int i;
for(i=0;i<(C<<4);i++) V+=un/(un+V
V);
B[tid]=V;
}
}

the call:
calc_Kernel<<<dim3(Nt/32, Nx/16, 1), dim3(32, 16, 1)>>>(dev_A,dev_B, Nt, Nx, 8);

OK, so in single precision the performance difference you are observing is 2.2x, rather than a theoretical maximum of 4x. Can you try different values of C again to make sure that memory bandwidth is not the new bottleneck?

seibert,
Ok here is what i see on GTX 690 (BW is bandwidth)
calc test #1 run time is:4.94 ms, BW=50.61 Gb/s
calc test #2 run time is:8.40 ms, BW=29.77 Gb/s
calc test #3 run time is:11.98 ms, BW=20.87 Gb/s
calc test #4 run time is:15.59 ms, BW=16.04 Gb/s
calc test #5 run time is:19.20 ms, BW=13.02 Gb/s
calc test #6 run time is:22.82 ms, BW=10.96 Gb/s
calc test #7 run time is:24.63 ms, BW=10.15 Gb/s
calc test #8 run time is:27.00 ms, BW=9.26 Gb/s
calc test #9 run time is:30.25 ms, BW=8.27 Gb/s
calc test #10 run time is:33.50 ms, BW=7.46 Gb/s
calc test #11 run time is:36.75 ms, BW=6.80 Gb/s
calc test #12 run time is:39.59 ms, BW=6.31 Gb/s

and here is GTX 285:
calc test #1 run time is:9.66 ms, BW=25.89 Gb/s
calc test #2 run time is:17.61 ms, BW=14.19 Gb/s
calc test #3 run time is:24.73 ms, BW=10.11 Gb/s
calc test #4 run time is:31.84 ms, BW=7.85 Gb/s
calc test #5 run time is:38.97 ms, BW=6.42 Gb/s
calc test #6 run time is:46.07 ms, BW=5.43 Gb/s
calc test #7 run time is:53.19 ms, BW=4.70 Gb/s
calc test #8 run time is:60.30 ms, BW=4.15 Gb/s
calc test #9 run time is:67.42 ms, BW=3.71 Gb/s
calc test #10 run time is:74.53 ms, BW=3.35 Gb/s
calc test #11 run time is:81.65 ms, BW=3.06 Gb/s
calc test #12 run time is:88.77 ms, BW=2.82 Gb/s

of test is C actually. You can see the perfect scaling in the results.

PS I’m not posting the whole code because it contains cuda3.0 calls for timer (with cutil) thus it will be hard to compile it on cuda5.0