Hi all:

I am trying to profile how fast my GPU can reach.

I copy a simple saxpy program but cannot figure out the result.

```
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a * x[i] + y[i];
}
int main(void)
{
int N = 20 * (1 << 20);
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
/*measure time start*/
saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y);
/*measure time stop*/
...
}
```

EX, it sadly reaches max FLOPS only 0.1%…

So I try to unroll the loop; it means to do more work within each thread, as:

```
__global__
void saxpy(int n, float a, float *x, float *y, float *z)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int offset = 128 * i;
if (i < n)
for (int k = 0; k < 128; k++) {
z[offset + k] = a * x[offset + k] + y[offset + k];
}
}
```

In this example I do 128 calculations within a thread, instead of just one as origin.

However, it makes thing worse… Notice that the # of blocks are reduced accordingly so total # of op are the same.

Finally, I tries to use a temp variable (or a temp array) to store the result.

Now it becomes much faster & close to the theoretic max FLOPS.

Even that I think the way is wrong since it cannot get back the result.

```
tmp = a * x[offset + k] + y[offset + k];
```

My question is:

- Could I measure by this way? why temp variable(or array) makes it mush faster?
- Why unroll the loop makes it slower?
- I used Cublus’s API = cublasSaxpy & the result is close the 0.1% also; dose it implies something?

Also, I found the example of CUDA named: 0_Simple/matrixMul has a profile program also.

It can reach 1330.88 GFlop/s compared to ~60 GFlops of this even simpler test program. It makes me confused….

Thanks in advance!