tuning SGEMM

Hi, I’m trying to implement a kernel of sgemm faster then Sdk sample.

I read this link and this link but source code are very hard to me.

I just read this Volkov talk http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

specially Part IV and I try to make same changes to sdk code.

With block-size=16 I can see some speed increase, but test failed !!

I use makefile of sdk to compile program, but I don’t understand how to remove “–maxrregcount 32" parameter if it is default set.

However I have this results :

test with kernel changed

test with original kernel

block size 32 - test with kernel changed

test with original kernel

here code of kernel

__global__ void

matrixMul3( float* C, float* A, float* B, int wA, int wB)

{

    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

// Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

    int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

    int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

    int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

    int bStep  = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix

    // that is computed by the thread

    float Csub[2] = {0,0};

// Loop over all the sub-matrices of A and B

    // required to compute the block sub-matrix

    for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

        // store the sub-matrix of A

        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

        // store the sub-matrix of B

        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

        // to shared memory; each thread loads

        // one element of each matrix

        AS(ty,tx) = A[a + wA * ty + tx];

        BS(ty,tx) = B[b + wB * ty + tx];

	AS(ty+16,tx) = A[a + wA * (ty+16) + tx];

        BS(ty+16,tx) = B[b + wB * (ty+16) + tx];

// Synchronize to make sure the matrices are loaded

        __syncthreads();

	// Multiply the two matrices together;

        // each thread computes one element

        // of the block sub-matrix

#pragma unroll

        for (int k = 0; k < BLOCK_SIZE; ++k)

	{

            Csub[0] += AS(ty,k) * BS(k,tx);

	    Csub[1] += AS(ty+16,k) * BS(k,tx);

	}

        // Synchronize to make sure that the preceding

        // computation is done before loading two new

        // sub-matrices of A and B in the next iteration

        __syncthreads();

    }

// Write the block sub-matrix to device memory;

    // each thread writes one element

    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

    C[c + wB * ty + tx] = Csub[0];

    C[c + wB * (ty+16) + tx] = Csub[1];

	

}