Hi everybody,

I’m doing some performance tests on my Nvidia Quadro FX 1700 : I want to compute the GFlops / sec number during my program execution.

My cuda code :

```
__global__
void matrixMulKernel( float* A, float* B, float* C, int N) {
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int aBegin = N * BLOCKSIZE * by;
int aEnd = aBegin + N - 1;
int aStep = BLOCKSIZE;
int bBegin = BLOCKSIZE * bx;
int bStep = BLOCKSIZE * N;
float Csub = 0;
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
__shared__ float As[BLOCKSIZE][BLOCKSIZE];
__shared__ float Bs[BLOCKSIZE][BLOCKSIZE];
As[ty][tx] = A[a + N * ty + tx];
Bs[ty][tx] = B[b + N * ty + tx];
__syncthreads();
for (int k = 0; k < BLOCKSIZE; ++k) {
Csub += As[ty][k] * Bs[k][tx];
}
__syncthreads();
}
int c = N * BLOCKSIZE * by + BLOCKSIZE * bx;
C[c + N * ty + tx] = Csub;
}
```

As you can see, it concerns multiplication of square matrices, and BLOCKSIZE is equal to 16. I assume the number of floating operations is NB = (2N - 1) * NÂ². So I measure the time execution T with cudaEventRecord. To get the GFlops number, i just divide NB by T… But I think I get wrong numbers :

N = 128 => 4.52 GFlops/s (5.93 GFlops/s with Cublas)

N = 384 => 10.54 GFlops/s (21.14 GFlops/s with Cublas)

N = 640 => 12.28 GFlops/s (26.56 GFlops/s with Cublas)

N = 896 => 13.01 GFlops/s (28.21 GFlops/s with Cublas)

N = 1152 => 13.42 GFlops/s (30.37 GFlops/s with Cublas)

N = 1408 => 13.67 GFlops/s (31.09 GFlops/s with Cublas)

N = 1664 => 13.83 GFlops/s (31.59 GFlops/s with Cublas)

N = 1920 => 13.90 GFlops/s (31.93 GFlops/s with Cublas)

N = 2176 => 13.91 GFlops/s (32.18 GFlops/s with Cublas)

N = 2432 => 14.04 GFlops/s (31.83 GFlops/s with Cublas)

Did I do something wrong??

PS: my device information :

```
Device 0: "Quadro FX 1700"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes
Test PASSED
```