Instruction Latency

Well, you already answered your own question. ;)

Also try:

a=b*a+a

a=a*a+a

a=a+a

a=a*a

For the constant cache it would make sense, since the 32-wide load is split into two halfs, and tho constant cache runs at half the shader clock (according to the documentation).

Shared memory requires arbitration logic and a full crossbar between execution units and SRAM banks, so it will not be surprising if it require one extra (slow) clock cycle.