Compiler optimization for texture fetch? unroll texture fetch.

What is the compiler optimization if I unroll the texture fetch.

I have a kernel which uses 1D texture fetch and have some calcalations. If I write

for(int i = 0; i < 3; i++)
a[i] = tex1Dfetch(tex, b + i);
… // calculations

There are 16 registers for each thread and the occupancy is 67%. If I write

a[0] = tex1Dfetch(tex, b + 0);
a[1] = tex1Dfetch(tex, b + 1);
a[2] = tex1Dfetch(tex, b + 2);
… // calculations

There are 23 registers for each thread and the occupancy is only 33%.

But actually the kernel in the 2nd case will be faster than the 1st. What is compiler’s optimization for this? Although it will cost a lot of registers, the speed will be faster.

first is like

int i = 0;
loop:
a[i] = tex1Dfetch( tex, b + i );
i = i + 1;
if( i < 3 ) goto loop;

so therefor u got per loop:

  • 1 more add
  • 1 additional if
  • 1 additional jump

But compare with the cost of texture fetch, the cost of loop is just a little. And the occupancy of the 1st is much higher than the 2nd. That means it can hide the fetch latency better. Actually if we write assembly, the register of the 1st one will be 1 more than the 2nd one (it’s the “i” in the loop). So I guess there should be some optimization which cost a lot of register.