benchmark CUDA CuBLas and OpenCL

I have done some test using CUDA Cublas and OpenCl for matrix multiplication.

the kernel CUDA and OpenCL are same of matrixMul in CUDA sdk sample, Cublas use cublasSgemm function.

Below a graphic of time results using for cuda and openCL blocksize=16x16

External Media

you can see Cuda and openCL reach max 3 gflop/s while Cublas over 45 gflop/s

Cublas result exceeds my expectations, are these results possible ??

here code for gflop/s calculation

//L*M = size matrix A  M*N=size matrix B  L*N=size matrix C

float gpuTime = cutGetTimerValue(timer)/1000;  

double dNumOps = 2.0 * (double)M * (double)L * (double)N;

*gflops = 1.0e-9 * dNumOps/gpuTime;

looks like you are doing a lot of things wrong.

Those results are not necessarily that surprising, although that depends a bit on which GPU this is and precisely how you have done the timing. Your calculation for the throughput of gemm looks correct to me. A Tesla C2050 can hit well over 500 GFlop/s in sgemm() at large matrix sizes using either CUBLAS or MAGMA.

You might be interested in section IV of this talk, which shows just how much performance improvement can be gotten out of the CUDA SDK matrix multiply routine just by focussing on things like instruction level parallelism and memory access optimization. The current CUBLAS gemm kernels include a lot of innovative thinking on algorithm design and throughput optimization developed by people like Jim Demmel and Vasily Volkov (the author of the talk I linked to) from Berkley and Jack Dongarra and Stan Tomov from UTK. It should be considered state of the art.

Yes, they used many optimizations in cublas, but they struggle for tens of percents. SDK sample is about 5 times slower. Sometimes even less slower. Maybe it is very new gpu and old sdk, but I suppose that maybe init time is included in timing, or transfer time etc. You can run sdk sample and see that it is not that slow.

this is output of device query

Device 0: "GeForce 9500 GT"

  CUDA Driver Version:                           3.10

  CUDA Runtime Version:                          3.10

  CUDA Capability Major revision number:         1

  CUDA Capability Minor revision number:         1

  Total amount of global memory:                 1073414144 bytes

  Number of multiprocessors:                     4

  Number of cores:                               32

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.38 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     No

  Integrated:                                    No

  Support host page-locked memory mapping:       No

  Compute mode:                                  Default (multiple host threads                                              can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Vers                                             ion = 3.10, NumDevs = 1, Device = GeForce 9500 GT

PASSED

something wrong I note in original kernel version of SDK matric multiplication.

If size A = 320x640 and B= 640x320 the grid size will be 20x20, the test of correctness pass, and I have 20 gflops !!

But if size A=640x540 and B 640x320 the grid is not square and results are wrong, test Fail !!

My version of this kernel work fine of not square grid too, but it’s more slow. Why ?

below original kernel in cuda sdk, the difference are only in indexing of threads…

__global__ void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

// Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

    int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

    int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

    int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

    int bStep  = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix

    // that is computed by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B

    // required to compute the block sub-matrix

    for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

        // store the sub-matrix of A

        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

        // store the sub-matrix of B

        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

        // to shared memory; each thread loads

        // one element of each matrix

        AS(ty,tx) = A[a + wA * ty + tx];

        BS(ty,tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

        __syncthreads();

// Multiply the two matrices together;

        // each thread computes one element

        // of the block sub-matrix

        for (int k = 0; k < BLOCK_SIZE; ++k)

            Csub += AS(ty,k) * BS(k,tx);

// Synchronize to make sure that the preceding

        // computation is done before loading two new

        // sub-matrices of A and B in the next iteration

        __syncthreads();

    }

// Write the block sub-matrix to device memory;

    // each thread writes one element

    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

    C[c + wB * ty + tx] = Csub;

}

this is my version

__global__ void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

    int bx = blockIdx.x;

    int by = blockIdx.y;

int tx = threadIdx.x;

    int ty = threadIdx.y;

int aBegin = wA * BLOCK_SIZE * bx;

int aEnd   = aBegin + wA - 1;

int aStep  = BLOCK_SIZE;

int bBegin = BLOCK_SIZE * by;

int bStep  = BLOCK_SIZE * wB;

float Csub = 0;

for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(tx,ty)=A[a + wA * tx + ty];

        BS(tx,ty) = B[b + wB * tx + ty];

__syncthreads();

for (int k = 0; k < BLOCK_SIZE; k++)

                     Csub += AS(tx,k) * BS(k,ty);

__syncthreads();

    }

int c=wB * BLOCK_SIZE * bx + BLOCK_SIZE*by;

         C[c + wB * tx + ty] = Csub;

}

50 Gflop/s SGEMM from a 9500GT might be feasible - I used to have one as a development card a couple of years ago and I remember it hitting about 40 GFlop/s in SGEMM with a much older version of CUBLAS. The theoretical peak performance of your card is about 88 Gflop/s, so 50/88 = 57% computational efficiency isn’t beyond the realms of possibility. Especially if this is a DDR3 version with reasonable memory bandwidth.

As I recall, the SDK matmul example was only intended for square matrices whose dimensions are evenly divisible by the block size (which was 16x16 in the original version). I think there are some comments in the code to that effect. That limitation would probably explain the problems with the non square case.

ok thank you! But I don’t understand why changing matrix index can increase kernel performance…

Below are show some output of my version of matmul kernel and sdk kernel, using same input, executation time is very different !!

my kernel output

L= 320 M=640 N=320

LMN = 65536000

block-size= 16 * 16

grid-size = 20 * 20

execution time for kernel : 0.043361 s

Gflops 3.02

Err_norm : 178991888.000000 ref_norm : 2481616534372352.000000 Error : 0.000000

PASSED

sdk kernel output

LMN = 65536000

block-size= 16 * 16

grid-size = 20 * 20

execution time for kernel : 0.006271 s

Gflops 20.90

Err_norm : 178991888.000000 ref_norm : 2481616534372352.000000 Errore : 0.000000

PASSED

nobady can help me ?

The problem is memory coalesced.

You code is completely non-coalesced.

int tx = threadIdx.x;

   int ty = threadIdx.y;

   ...

   AS(tx,ty)=A[a + wA * tx + ty];

   BS(tx,ty) = B[b + wB * tx + ty];

when ty = 0, different tx corresponding a stride of wA, not contiguous.

According to Appendix G of programming guide, penalty of non-coalesced access on sm1.1 is 8x,

and 9500GT is sm1.1

This can explain experimental result, 3.02Gflops versus 20.9 Gflops.

Remember that sgemm is computational-intensive only if you satisfy coalesced property.

Thank you, then in a block of threads there are column-major order ?

If I have a block 3x3 and matrix A is 3x3 this assignment is coalesced ?

thread id (x,y)      matrix index

0-0                  0

1-0                  1

2-0                  2

0-1                  3

1-1                  4

2-1                  5

0-2                  6

1-2                  7

2-2                  8

The problem is not row-major or column-major.

threads of a warp or half-warp must access one cache line to achieve coalesced.

if you write

tx = threadIdx.x ;

   ty = threadIdx.y ;

   AS(tx,ty)=A[a + wA * tx + ty];

Then it is not coalesced no matter A is row-major or column-major.

perhaps I have not explained at well.

In programming guide I read this :

the question is:

If kth thread in a half-warp is thread (tx,ty) in a 2D block, then thread k+1 is (tx+1,ty) on same block ??

thank you :)

yes, you can check section 2.2 of cuda programming guide.

thanks!