I’ve done my first test and tell me if I’m wrong :

My kernel is :

```
__global__ void addOne(float* t, int nb){
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
t[yIndex*nb+xIndex] += 1;
}
```

Given a matrix t of size nb*nb, the kernel add 1 to each element.

In the main function, I do the following :

```
dim3 grid(nb/BLOCK_DIM, nb/BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM,BLOCK_DIM,1);
cudaEventRecord(start, 0);
addOne<<<grid,threads>>>(tab_dev,nb);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Duration : %f ms\n", elapsedTime);
printf("Bandwidth : %f GBytes/s\n", ((nb*nb*2)*sizeof(float))/(elapsedTime*1000000));
printf("GFLOPS : %f\n", (7*nb*nb)/(elapsedTime*1000000));
```

Each thread read and write one time in the global memory => 2*nb*nb read/writes

In the kernel, I count 7 operation (+ and *) per thread => 7*nb*nb operations

So, for nb=8192 and BLOCK_DIM=16, I have these results

```
Duration : 10.438016 ms
Bandwidth : 51.434192 Gbytes/s
GFLOPS : 45.004918
```

Is this correct or do I have made a mistake?