I’m trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max.
__global__ void new_ker(float *x)
{
int index = threadIdx.x+blockIdx.x*blockDim.x;
float a,b;
a=0;
b=x[index];
//LOOP=10000000
//No. of blocks = 1
//Threads per block = 512 (I'm using GTX 275 - GT200 Arch.)
#pragma unroll 2048
for(int i=0;i<LOOP;i++){
a=a*b+b;
}
x[index] = a;
}
I don’t want to increase ILP in the code. Any ideas why it’s not reaching peak??