[Matrix Multiplication] GFlops on Nvidia Quadro FX 1700....

Hi everybody,

I’m doing some performance tests on my Nvidia Quadro FX 1700 : I want to compute the GFlops / sec number during my program execution.

My cuda code :

__global__ 

void matrixMulKernel( float* A, float* B, float* C, int N) {

  int bx = blockIdx.x;

  int by = blockIdx.y;

  int tx = threadIdx.x;

  int ty = threadIdx.y;

  int aBegin = N * BLOCKSIZE * by;

  int aEnd   = aBegin + N - 1;

  int aStep  = BLOCKSIZE;

  int bBegin = BLOCKSIZE * bx;

  int bStep  = BLOCKSIZE * N;

  float Csub = 0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

	__shared__ float As[BLOCKSIZE][BLOCKSIZE];

	__shared__ float Bs[BLOCKSIZE][BLOCKSIZE];

	

	As[ty][tx] = A[a + N * ty + tx];

	Bs[ty][tx] = B[b + N * ty + tx];

	

	__syncthreads();

	

	for (int k = 0; k < BLOCKSIZE; ++k) {

	  Csub += As[ty][k] * Bs[k][tx];

	}

	

	__syncthreads();

  }

int c = N * BLOCKSIZE * by + BLOCKSIZE * bx;

  C[c + N * ty + tx] = Csub;

}

As you can see, it concerns multiplication of square matrices, and BLOCKSIZE is equal to 16. I assume the number of floating operations is NB = (2N - 1) * N². So I measure the time execution T with cudaEventRecord. To get the GFlops number, i just divide NB by T… But I think I get wrong numbers :

    N = 128 => 4.52 GFlops/s (5.93 GFlops/s with Cublas)

    N = 384 => 10.54 GFlops/s (21.14 GFlops/s with Cublas)

    N = 640 => 12.28 GFlops/s (26.56 GFlops/s with Cublas)

    N = 896 => 13.01 GFlops/s (28.21 GFlops/s with Cublas)

    N = 1152 => 13.42 GFlops/s (30.37 GFlops/s with Cublas)

    N = 1408 => 13.67 GFlops/s (31.09 GFlops/s with Cublas)

    N = 1664 => 13.83 GFlops/s (31.59 GFlops/s with Cublas)

    N = 1920 => 13.90 GFlops/s (31.93 GFlops/s with Cublas)

    N = 2176 => 13.91 GFlops/s (32.18 GFlops/s with Cublas)

    N = 2432 => 14.04 GFlops/s (31.83 GFlops/s with Cublas)

Did I do something wrong??

PS: my device information :

Device 0: "Quadro FX 1700"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 536150016 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

Test PASSED

single precision performance without dual issue is 32(core) x 0.92 (GHz) x 2 = 58.8 Gflop/s

and CUBLAS reaches 31.83/58.8 = 54%, this is correct.

I want to know what’s the means of the figures in brackets .
I met the same problem, i found that when N=2048, the GFlops is 14.0Gflops/s.
I think it’s somehow quite low as a result of the maxmium GFlops of Quadro FX 1700 is 88.3Gflops/s.

Any one has an idea?

I want to know what’s the means of the figures in brackets .
I met the same problem, i found that when N=2048, the GFlops is 14.0Gflops/s.
I think it’s somehow quite low as a result of the maxmium GFlops of Quadro FX 1700 is 88.3Gflops/s.

Any one has an idea?

cublas has more optimized implementation of matrix mul, integer multiplication costs four time more than floating point for example. Although it is not major bottleneck here.

cublas has more optimized implementation of matrix mul, integer multiplication costs four time more than floating point for example. Although it is not major bottleneck here.