I’m doing some performance tests on my Nvidia Quadro FX 1700 : I want to compute the GFlops / sec number during my program execution.

My cuda code :

``````__global__

void matrixMulKernel( float* A, float* B, float* C, int N) {

int bx = blockIdx.x;

int by = blockIdx.y;

int aBegin = N * BLOCKSIZE * by;

int aEnd   = aBegin + N - 1;

int aStep  = BLOCKSIZE;

int bBegin = BLOCKSIZE * bx;

int bStep  = BLOCKSIZE * N;

float Csub = 0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

__shared__ float As[BLOCKSIZE][BLOCKSIZE];

__shared__ float Bs[BLOCKSIZE][BLOCKSIZE];

As[ty][tx] = A[a + N * ty + tx];

Bs[ty][tx] = B[b + N * ty + tx];

for (int k = 0; k < BLOCKSIZE; ++k) {

Csub += As[ty][k] * Bs[k][tx];

}

}

int c = N * BLOCKSIZE * by + BLOCKSIZE * bx;

C[c + N * ty + tx] = Csub;

}
``````

As you can see, it concerns multiplication of square matrices, and BLOCKSIZE is equal to 16. I assume the number of floating operations is NB = (2N - 1) * NÂ². So I measure the time execution T with cudaEventRecord. To get the GFlops number, i just divide NB by T… But I think I get wrong numbers :

N = 128 => 4.52 GFlops/s (5.93 GFlops/s with Cublas)

N = 384 => 10.54 GFlops/s (21.14 GFlops/s with Cublas)

N = 640 => 12.28 GFlops/s (26.56 GFlops/s with Cublas)

N = 896 => 13.01 GFlops/s (28.21 GFlops/s with Cublas)

N = 1152 => 13.42 GFlops/s (30.37 GFlops/s with Cublas)

N = 1408 => 13.67 GFlops/s (31.09 GFlops/s with Cublas)

N = 1664 => 13.83 GFlops/s (31.59 GFlops/s with Cublas)

N = 1920 => 13.90 GFlops/s (31.93 GFlops/s with Cublas)

N = 2176 => 13.91 GFlops/s (32.18 GFlops/s with Cublas)

N = 2432 => 14.04 GFlops/s (31.83 GFlops/s with Cublas)

Did I do something wrong??

PS: my device information :

``````Device 0: "Quadro FX 1700"

Major revision number:						 1

Minor revision number:						 1

Total amount of global memory:				 536150016 bytes

Number of multiprocessors:					 4

Number of cores:							   32

Total amount of constant memory:			   65536 bytes

Total amount of shared memory per block:	   16384 bytes

Total number of registers available per block: 8192

Warp size:									 32

Maximum number of threads per block:		   512

Maximum sizes of each dimension of a block:	512 x 512 x 64

Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

Maximum memory pitch:						  262144 bytes

Texture alignment:							 256 bytes

Clock rate:									0.92 GHz

Concurrent copy and execution:				 Yes

Test PASSED
``````

single precision performance without dual issue is 32(core) x 0.92 (GHz) x 2 = 58.8 Gflop/s

and CUBLAS reaches 31.83/58.8 = 54%, this is correct.

cublas has more optimized implementation of matrix mul, integer multiplication costs four time more than floating point for example. Although it is not major bottleneck here.

