How to squeeze the performance on G80?

Now I have an implementation of motion estimation on G80. Measured perforamance is 90GFlOPS. about 1/4 of the peak.
There are two reasons for this gap :

  1. about 0.5 of the intructions are not doing real arithmetic operations, they are contributed to routines like index computing, memory load/store.
  2. Average cycles for each warp instruction is about 8 cycles (from visual profiler). Ideally, one warp instruction takes only 4 cycles .

My questions is what causes inefficient cycles per warp instruction (reason 2) ? Is there any tool useful for analyzing what exactly happens at each cycle? And how the bridge the gap.

Thanks!

How are you computing 2? The profiler only samples one SM, so you should use it only for relative order of magnitude estimations, not for accurate computations.

Mark