benchmark CUDA CuBLas and OpenCL

tulkas85 · January 23, 2011, 11:49am

I have done some test using CUDA Cublas and OpenCl for matrix multiplication.

the kernel CUDA and OpenCL are same of matrixMul in CUDA sdk sample, Cublas use cublasSgemm function.

Below a graphic of time results using for cuda and openCL blocksize=16x16

you can see Cuda and openCL reach max 3 gflop/s while Cublas over 45 gflop/s

Cublas result exceeds my expectations, are these results possible ??

here code for gflop/s calculation

//L*M = size matrix A  M*N=size matrix B  L*N=size matrix C

float gpuTime = cutGetTimerValue(timer)/1000;  

double dNumOps = 2.0 * (double)M * (double)L * (double)N;

*gflops = 1.0e-9 * dNumOps/gpuTime;

Lev · January 23, 2011, 12:13pm

looks like you are doing a lot of things wrong.

avidday · January 23, 2011, 12:27pm

Those results are not necessarily that surprising, although that depends a bit on which GPU this is and precisely how you have done the timing. Your calculation for the throughput of gemm looks correct to me. A Tesla C2050 can hit well over 500 GFlop/s in sgemm() at large matrix sizes using either CUBLAS or MAGMA.

You might be interested in section IV of this talk, which shows just how much performance improvement can be gotten out of the CUDA SDK matrix multiply routine just by focussing on things like instruction level parallelism and memory access optimization. The current CUBLAS gemm kernels include a lot of innovative thinking on algorithm design and throughput optimization developed by people like Jim Demmel and Vasily Volkov (the author of the talk I linked to) from Berkley and Jack Dongarra and Stan Tomov from UTK. It should be considered state of the art.

Lev · January 23, 2011, 5:28pm

Yes, they used many optimizations in cublas, but they struggle for tens of percents. SDK sample is about 5 times slower. Sometimes even less slower. Maybe it is very new gpu and old sdk, but I suppose that maybe init time is included in timing, or transfer time etc. You can run sdk sample and see that it is not that slow.

tulkas85 · January 25, 2011, 11:02am

this is output of device query

Device 0: "GeForce 9500 GT"

  CUDA Driver Version:                           3.10

  CUDA Runtime Version:                          3.10

  CUDA Capability Major revision number:         1

  CUDA Capability Minor revision number:         1

  Total amount of global memory:                 1073414144 bytes

  Number of multiprocessors:                     4

  Number of cores:                               32

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.38 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     No

  Integrated:                                    No

  Support host page-locked memory mapping:       No

  Compute mode:                                  Default (multiple host threads                                              can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Vers                                             ion = 3.10, NumDevs = 1, Device = GeForce 9500 GT

PASSED

something wrong I note in original kernel version of SDK matric multiplication.

If size A = 320x640 and B= 640x320 the grid size will be 20x20, the test of correctness pass, and I have 20 gflops !!

But if size A=640x540 and B 640x320 the grid is not square and results are wrong, test Fail !!

My version of this kernel work fine of not square grid too, but it’s more slow. Why ?

below original kernel in cuda sdk, the difference are only in indexing of threads…

__global__ void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

// Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

    int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

    int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

    int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

    int bStep  = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix

    // that is computed by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B

    // required to compute the block sub-matrix

    for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

        // store the sub-matrix of A

        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

        // store the sub-matrix of B

        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

        // to shared memory; each thread loads

        // one element of each matrix

        AS(ty,tx) = A[a + wA * ty + tx];

        BS(ty,tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

        __syncthreads();

// Multiply the two matrices together;

        // each thread computes one element

        // of the block sub-matrix

        for (int k = 0; k < BLOCK_SIZE; ++k)

            Csub += AS(ty,k) * BS(k,tx);

// Synchronize to make sure that the preceding

        // computation is done before loading two new

        // sub-matrices of A and B in the next iteration

        __syncthreads();

    }

// Write the block sub-matrix to device memory;

    // each thread writes one element

    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

    C[c + wB * ty + tx] = Csub;

}

this is my version

__global__ void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

    int bx = blockIdx.x;

    int by = blockIdx.y;

int tx = threadIdx.x;

    int ty = threadIdx.y;

int aBegin = wA * BLOCK_SIZE * bx;

int aEnd   = aBegin + wA - 1;

int aStep  = BLOCK_SIZE;

int bBegin = BLOCK_SIZE * by;

int bStep  = BLOCK_SIZE * wB;

float Csub = 0;

for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(tx,ty)=A[a + wA * tx + ty];

        BS(tx,ty) = B[b + wB * tx + ty];

__syncthreads();

for (int k = 0; k < BLOCK_SIZE; k++)

                     Csub += AS(tx,k) * BS(k,ty);

__syncthreads();

    }

int c=wB * BLOCK_SIZE * bx + BLOCK_SIZE*by;

         C[c + wB * tx + ty] = Csub;

}

avidday · January 25, 2011, 12:00pm

50 Gflop/s SGEMM from a 9500GT might be feasible - I used to have one as a development card a couple of years ago and I remember it hitting about 40 GFlop/s in SGEMM with a much older version of CUBLAS. The theoretical peak performance of your card is about 88 Gflop/s, so 50/88 = 57% computational efficiency isn’t beyond the realms of possibility. Especially if this is a DDR3 version with reasonable memory bandwidth.

As I recall, the SDK matmul example was only intended for square matrices whose dimensions are evenly divisible by the block size (which was 16x16 in the original version). I think there are some comments in the code to that effect. That limitation would probably explain the problems with the non square case.

tulkas85 · January 25, 2011, 1:15pm

ok thank you! But I don’t understand why changing matrix index can increase kernel performance…

Below are show some output of my version of matmul kernel and sdk kernel, using same input, executation time is very different !!

my kernel output

L= 320 M=640 N=320

LMN = 65536000

block-size= 16 * 16

grid-size = 20 * 20

execution time for kernel : 0.043361 s

Gflops 3.02

Err_norm : 178991888.000000 ref_norm : 2481616534372352.000000 Error : 0.000000

PASSED

sdk kernel output

LMN = 65536000

block-size= 16 * 16

grid-size = 20 * 20

execution time for kernel : 0.006271 s

Gflops 20.90

Err_norm : 178991888.000000 ref_norm : 2481616534372352.000000 Errore : 0.000000

PASSED

tulkas85 · January 28, 2011, 2:29pm

nobady can help me ?

LSChien · January 30, 2011, 9:10am

The problem is memory coalesced.

You code is completely non-coalesced.

int tx = threadIdx.x;

   int ty = threadIdx.y;

   ...

   AS(tx,ty)=A[a + wA * tx + ty];

   BS(tx,ty) = B[b + wB * tx + ty];

when ty = 0, different tx corresponding a stride of wA, not contiguous.

According to Appendix G of programming guide, penalty of non-coalesced access on sm1.1 is 8x,

and 9500GT is sm1.1

This can explain experimental result, 3.02Gflops versus 20.9 Gflops.

Remember that sgemm is computational-intensive only if you satisfy coalesced property.

tulkas85 · January 31, 2011, 5:46pm

The problem is memory coalesced.

You code is completely non-coalesced.
int tx = threadIdx.x;

   int ty = threadIdx.y;

   ...

   AS(tx,ty)=A[a + wA * tx + ty];

   BS(tx,ty) = B[b + wB * tx + ty];
when ty = 0, different tx corresponding a stride of wA, not contiguous.

According to Appendix G of programming guide, penalty of non-coalesced access on sm1.1 is 8x,

and 9500GT is sm1.1

This can explain experimental result, 3.02Gflops versus 20.9 Gflops.

Remember that sgemm is computational-intensive only if you satisfy coalesced property.

Thank you, then in a block of threads there are column-major order ?

If I have a block 3x3 and matrix A is 3x3 this assignment is coalesced ?

thread id (x,y)      matrix index

0-0                  0

1-0                  1

2-0                  2

0-1                  3

1-1                  4

2-1                  5

0-2                  6

1-2                  7

2-2                  8

LSChien · February 1, 2011, 6:20am

The problem is not row-major or column-major.

threads of a warp or half-warp must access one cache line to achieve coalesced.

if you write

tx = threadIdx.x ;

   ty = threadIdx.y ;

   AS(tx,ty)=A[a + wA * tx + ty];

Then it is not coalesced no matter A is row-major or column-major.

tulkas85 · February 1, 2011, 2:07pm

perhaps I have not explained at well.

In programming guide I read this :

the question is:

If kth thread in a half-warp is thread (tx,ty) in a 2D block, then thread k+1 is (tx+1,ty) on same block ??

thank you :)

LSChien · February 1, 2011, 2:11pm

yes, you can check section 2.2 of cuda programming guide.

tulkas85 · February 1, 2011, 2:49pm

thanks!

Topic		Replies	Views
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	923	August 23, 2018
my speedy SGEMM CUDA Programming and Performance	91	275884	May 29, 2013
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4192	January 14, 2010
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	714	September 6, 2023
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1631	September 29, 2009
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9994	March 24, 2014
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18188	March 30, 2011
Please help for using cublas Zgemm~ CUDA Programming and Performance	4	1726	July 27, 2015
CUDA gemm with shared memory is too slow CUDA Programming and Performance cuda	2	30	October 21, 2024
Effective global memory bandwidth? CUDA Programming and Performance	17	17556	September 18, 2007

benchmark CUDA CuBLas and OpenCL

Related topics