According to OpenCL Best Practice Guide : Register dependencies arise when an instruction uses a result stored in a register written by an instruction before it. The latency on current CUDA-enabled GPUs is approximately 24 cycles, so threads must wait 24 cycles before using an arithmetic result.
Found a loop where i have the same problem. and tested.
result : it’s not always a good idea to cache result of simple operation in register.
this code :
    while( (iter < maxIter) && ((zr*zr+zi*zi) < escapeOrbit) )
    {
      temp = zr * zi;
      zr = zr*zr - zi*zi + cr;
      zi = temp + temp + ci;
      //etc ....
    }
is faster than :
    while( (iter < maxIter) && ((zr2+zi2) < escapeOrbit) )
    {
      temp = zr * zi;
      zr2 = zr * zr;
      zi2 = zi * zi;
      zr = zr2 - zi2 + cr;
      zi = temp + temp + ci;
      //etc ....
    }
The best way of handling these is just to have at least 6 warps active on each multiprocessor (for compute capability 1.x devices). Assuming round-robin scheduling, this will completely hide latencies of 6*4=24 cycles.