What is the compiler optimization if I unroll the texture fetch.
I have a kernel which uses 1D texture fetch and have some calcalations. If I write
for(int i = 0; i < 3; i++)
a[i] = tex1Dfetch(tex, b + i);
… // calculations
There are 16 registers for each thread and the occupancy is 67%. If I write
a[0] = tex1Dfetch(tex, b + 0);
a[1] = tex1Dfetch(tex, b + 1);
a[2] = tex1Dfetch(tex, b + 2);
… // calculations
There are 23 registers for each thread and the occupancy is only 33%.
But actually the kernel in the 2nd case will be faster than the 1st. What is compiler’s optimization for this? Although it will cost a lot of registers, the speed will be faster.