Register Latency

According to OpenCL Best Practice Guide : Register dependencies arise when an instruction uses a result stored in a register written by an instruction before it. The latency on current CUDA-enabled GPUs is approximately 24 cycles, so threads must wait 24 cycles before using an arithmetic result.

So i’m in doubt on this code :

float q = (cr-0.25)*(cr-0.25) + ci2;

if( q*(q+(cr-0.25)) < 0.25*ci2) ...

versus :

float cr25 = cr-0.25;

float q = (cr25)*(cr25) + ci2;

if( q*(q+(cr25)) < 0.25*ci2) ...

(and maybe replacing “q” too ?).

if( ((cr-0.25)*(cr-0.25) + ci2)*(((cr-0.25)*(cr-0.25) + ci2)+(cr-0.25)) < 0.25*ci2)

Simply put : precomputing some simple operation ? or doing the simple operation multiple times to avoid register latency ?

thank you.

Found a loop where i have the same problem. and tested.

result : it’s not always a good idea to cache result of simple operation in register.

this code :

       while( (iter < maxIter) && ((zr*zr+zi*zi) < escapeOrbit) )

 Â  Â  Â  Â {

 Â  Â  Â  Â  Â  Â temp = zr * zi;

 Â  Â  Â  Â  Â  Â zr = zr*zr - zi*zi + cr;

 Â  Â  Â  Â  Â  Â zi = temp + temp + ci;

 Â  Â  Â  Â  Â  Â //etc ....

 Â  Â  Â  Â }

is faster than :

       while( (iter < maxIter) && ((zr2+zi2) < escapeOrbit) )

 Â  Â  Â  Â {

 Â  Â  Â  Â  Â  Â temp = zr * zi;

 Â  Â  Â  Â  Â  Â zr2 = zr * zr;

 Â  Â  Â  Â  Â  Â zi2 = zi * zi;

 Â  Â  Â  Â  Â  Â zr = zr2 - zi2 + cr;

 Â  Â  Â  Â  Â  Â zi = temp + temp + ci;

 Â  Â  Â  Â  Â  Â //etc ....

 Â  Â  Â  Â }

The best way of handling these is just to have at least 6 warps active on each multiprocessor (for compute capability 1.x devices). Assuming round-robin scheduling, this will completely hide latencies of 6*4=24 cycles.