Now I have an implementation of motion estimation on G80. Measured perforamance is 90GFlOPS. about 1/4 of the peak.

There are two reasons for this gap :

- about 0.5 of the intructions are not doing real arithmetic operations, they are contributed to routines like index computing, memory load/store.
- Average cycles for each warp instruction is about 8 cycles (from visual profiler). Ideally, one warp instruction takes only 4 cycles .

My questions is what causes inefficient cycles per warp instruction (reason 2) ? Is there any tool useful for analyzing what exactly happens at each cycle? And how the bridge the gap.

Thanks!