No, it’s not possible to run dependent operations back to back. That is why you should load the GPU to run ~24 threads per floating point unit, so that latency can always be hidden with instructions from independent threads (GPUs optimize throughput, not latency).
Anyway the latency is dependent on the operands… the latency is longer when a register is used several times as different operands of the same instruction. That’s probably some problem in register file read/operand fetch
you can take a look here: it’s messy and the results there aren’t comprehensive… but you can ask the people who have done enough measurement… I haven’t, anyways