I’m starting this topic with no console test output and examples. I just want to describe the problem to get your feedback if you concerned about it or may be faced it before.
What I’m trying to test is the computational performance of GTX690 (single die). Which has ~1500 cuda cores. I simply fill an array of double/floats of considerable size (Nt=4096 x Nx=8192) and then for each value I perform an operation of type
let’s say C(C=10) times. I see that it saturate the computational cores. It is indicated by significant bandwidth drop and good (proportional) time scaling with respect to C.
I used different geometries:
or cycled version with
<<<32,Nx>>>, <<<128,Nx>>> doing the math for Nt lines Nt/32 or Nt/128 times. (I use coalesced reading of course.)
What I see is the maximum performance of GK104 (GTX690) is about 1.5~2 of G80 (GTX285).
While the number of cores is 6 times greater!
I also decided to limit the occupancy of the kernel simply by adding smem and filling it with zeros:
shared float smem;
shared float smem[8192/16];
What I see is performance drop by ~6 times for a) and almost no performance drop for b) (since in b smem does not affect the occupancy in this case)
btw! for C=0 the bandwidth is about ~140 Gb/s for <<<128,Nx>>> variant and <<<(Nt/32,Nx/16,1),(32,16,1)>>> variant. Not very good but bandwidth test from cuda samples gives 149 Gb/s thus I suppose it is the reference point. I mean the code it self unleashes all bandwidth and stuck at math.
What do you think about it? I will be appreciate any response on this problem for G80, GF104, GF110, GK104, GK110 arch-s.
If any of you need I will post a test code.