[Matrix Multiplication] GFlops on Nvidia Quadro FX 1700....

spulvera · March 12, 2010, 9:50am

Hi everybody,

I’m doing some performance tests on my Nvidia Quadro FX 1700 : I want to compute the GFlops / sec number during my program execution.

My cuda code :

__global__ 

void matrixMulKernel( float* A, float* B, float* C, int N) {

  int bx = blockIdx.x;

  int by = blockIdx.y;

  int tx = threadIdx.x;

  int ty = threadIdx.y;

  int aBegin = N * BLOCKSIZE * by;

  int aEnd   = aBegin + N - 1;

  int aStep  = BLOCKSIZE;

  int bBegin = BLOCKSIZE * bx;

  int bStep  = BLOCKSIZE * N;

  float Csub = 0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

	__shared__ float As[BLOCKSIZE][BLOCKSIZE];

	__shared__ float Bs[BLOCKSIZE][BLOCKSIZE];

	

	As[ty][tx] = A[a + N * ty + tx];

	Bs[ty][tx] = B[b + N * ty + tx];

	

	__syncthreads();

	

	for (int k = 0; k < BLOCKSIZE; ++k) {

	  Csub += As[ty][k] * Bs[k][tx];

	}

	

	__syncthreads();

  }

int c = N * BLOCKSIZE * by + BLOCKSIZE * bx;

  C[c + N * ty + tx] = Csub;

}

As you can see, it concerns multiplication of square matrices, and BLOCKSIZE is equal to 16. I assume the number of floating operations is NB = (2N - 1) * NÂ². So I measure the time execution T with cudaEventRecord. To get the GFlops number, i just divide NB by T… But I think I get wrong numbers :

[*]N = 128 => 4.52 GFlops/s (5.93 GFlops/s with Cublas)

[*]N = 384 => 10.54 GFlops/s (21.14 GFlops/s with Cublas)

[*]N = 640 => 12.28 GFlops/s (26.56 GFlops/s with Cublas)

[*]N = 896 => 13.01 GFlops/s (28.21 GFlops/s with Cublas)

[*]N = 1152 => 13.42 GFlops/s (30.37 GFlops/s with Cublas)

[*]N = 1408 => 13.67 GFlops/s (31.09 GFlops/s with Cublas)

[*]N = 1664 => 13.83 GFlops/s (31.59 GFlops/s with Cublas)

[*]N = 1920 => 13.90 GFlops/s (31.93 GFlops/s with Cublas)

[*]N = 2176 => 13.91 GFlops/s (32.18 GFlops/s with Cublas)

[*]N = 2432 => 14.04 GFlops/s (31.83 GFlops/s with Cublas)

Did I do something wrong??

PS: my device information :

Device 0: "Quadro FX 1700"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 536150016 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

Test PASSED

LSChien · March 12, 2010, 10:13am

single precision performance without dual issue is 32(core) x 0.92 (GHz) x 2 = 58.8 Gflop/s

and CUBLAS reaches 31.83/58.8 = 54%, this is correct.

taolve · April 16, 2010, 5:36am

I want to know what’s the means of the figures in brackets .
I met the same problem, i found that when N=2048, the GFlops is 14.0Gflops/s.
I think it’s somehow quite low as a result of the maxmium GFlops of Quadro FX 1700 is 88.3Gflops/s.

Any one has an idea?

taolve · April 16, 2010, 5:36am

I want to know what’s the means of the figures in brackets .
I met the same problem, i found that when N=2048, the GFlops is 14.0Gflops/s.
I think it’s somehow quite low as a result of the maxmium GFlops of Quadro FX 1700 is 88.3Gflops/s.

Any one has an idea?

Lev · April 16, 2010, 9:36am

cublas has more optimized implementation of matrix mul, integer multiplication costs four time more than floating point for example. Although it is not major bottleneck here.

Lev · April 16, 2010, 9:36am

cublas has more optimized implementation of matrix mul, integer multiplication costs four time more than floating point for example. Although it is not major bottleneck here.

Topic		Replies	Views
Question about GPU FLops CUDA Programming and Performance cuda , kernel	5	77	November 19, 2024
How to compute the GFLOPS of a program? CUDA Programming and Performance	15	27467	June 24, 2011
Strange FLOP counts CUDA Programming and Performance	21	10082	March 15, 2008
Something looks like wrong - Gflops of Gt 330m(mobility cuda) CUDA Programming and Performance	1	1042	February 7, 2012
CUBLAS question a question about performance of CUBLAS CUDA Programming and Performance	4	5982	November 11, 2009
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2313	April 19, 2012
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
Optimizing cuBlas in kernels CUDA Programming and Performance	3	709	April 9, 2015
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28052	February 1, 2011
Calculatin FLOPS of GPU CUDA Programming and Performance	2	19094	February 10, 2017

[Matrix Multiplication] GFlops on Nvidia Quadro FX 1700....

Related topics