How to speed-up matrix multiplication using CUBLAS?

I use GeForce GTX 260 & CUDA 3.1.

I have a problem to multiply 2 single-precision real matrices using CUBLAS function cublasSgemm. Dimensions of matrices a litlle bit specific: 3x4 and 4x8120601.

So I have very small performance: 0.4 GFlop/s only (test 1024x1024 matrix multiplying has sufficient performance - 320 GFlop/s).

How can I speed-up the calculation? Maybe I can use some kinds of matrix algebraic decomplosition in blocks? Or something else can accelerate CUBLAS functions?

I would probably skip CUBLAS and write my own kernel if i were in your shoes.

You can keep the 3x4 matrix on-chip in either smem or even in registers ( compiled for that specific size ) and then just parallelize over the columns of the 4x8120601 and let for example each thread be responsible for 3 column values in the output matrix 3x8120601.

There are some more ways to tweak this in order to achieve higher throughput but that should be a good start.

I would probably skip CUBLAS and write my own kernel if i were in your shoes.

You can keep the 3x4 matrix on-chip in either smem or even in registers ( compiled for that specific size ) and then just parallelize over the columns of the 4x8120601 and let for example each thread be responsible for 3 column values in the output matrix 3x8120601.

There are some more ways to tweak this in order to achieve higher throughput but that should be a good start.

If I have correctly understood you:

For example, we initially have matrices A [hA, wA] and B [hB, wB], hB=wA.

Instead of one multiplication by matrix B it is done wB/3 multiplyings by matrices B_k [hB, 3], being submatrices B, so B = [B_0 B_1… B_wB/3].

In terms CUBLAS instead of

cublasSgemm ('N', 'N', hA, wB, wA, 1.0, A, hA, B, wA, 0.0, A C, hA)

we have

for (int i=0; i <wB/3, i ++)

   cublasSgemm ('N', 'N', hA, 3, wA, 1.0, A, hA, B+i*wA*3, wA, 0.0, C+i*hA*3, hA)

We replace second expression with own kernel having wB/3 threads. Every thread multiplies A and submatrix B_k. Elements of matrices A and B are in shared memory or may be even in registers (since hA = 3, wA = 4 and number of columns B_k = 3, we will have 34 + 34 + 3*3 = 33 float values, so the values may be placed in registers?).

I correctly understood your idea? Thanks in advance!

If I have correctly understood you:

For example, we initially have matrices A [hA, wA] and B [hB, wB], hB=wA.

Instead of one multiplication by matrix B it is done wB/3 multiplyings by matrices B_k [hB, 3], being submatrices B, so B = [B_0 B_1… B_wB/3].

In terms CUBLAS instead of

cublasSgemm ('N', 'N', hA, wB, wA, 1.0, A, hA, B, wA, 0.0, A C, hA)

we have

for (int i=0; i <wB/3, i ++)

   cublasSgemm ('N', 'N', hA, 3, wA, 1.0, A, hA, B+i*wA*3, wA, 0.0, C+i*hA*3, hA)

We replace second expression with own kernel having wB/3 threads. Every thread multiplies A and submatrix B_k. Elements of matrices A and B are in shared memory or may be even in registers (since hA = 3, wA = 4 and number of columns B_k = 3, we will have 34 + 34 + 3*3 = 33 float values, so the values may be placed in registers?).

I correctly understood your idea? Thanks in advance!

Hmmm i think you are on the right track at least.

Start with something simple. let each block have 32 threads, let each thread operate over one column of B.

Psuedo kernel code ( written for specific size in mind)

__global__ matrix_mul_kernel(float* d_A, float* d_B, float* d_out, int pitch)

{

   __shared__ float shared_A[3][4];

load(shared_A, d_A); // Load A values into shared

float B_col_vals[4]; // per thread column values (each block will have 32*4 values)

  float out_col_vals[3];

// Load values into B 

#pragma unroll

 for(int i = 0; i < 4; i++)

 {

B_col_vals[i] = d_B[threadIdx.x + blockIDx.x*32 + i*pitch]; // Might not unroll due to pitch being unkown at compile time => manual unrolling

}

#pragma unroll

  for(int j = 0; j < 3; j++)

 {

#pragma unroll

	for(int i = 0; i < 4; i++)

	   out_col_val[j] += B_col_vals[i]*shared_A[j][i];

}

// write to global

#pragma unroll

	for(int i = 0; i < 3; i++)

		   d_out[threadIdx.x + blockIdx.x*32 + i*pitch] = out_col_val[i];	 

}

// Please excuse any mistakes in adressing that i might have made...

Above is the raw and ugly way to do it for your specific size. Here once also needs to make sure to use cudaMallocPitch() when allocating the matrices to ensure coalescing when looping over the rows ( since num_cols is not a multiple of 32).

Ok, let me know what you think!

Hmmm i think you are on the right track at least.

Start with something simple. let each block have 32 threads, let each thread operate over one column of B.

Psuedo kernel code ( written for specific size in mind)

__global__ matrix_mul_kernel(float* d_A, float* d_B, float* d_out, int pitch)

{

   __shared__ float shared_A[3][4];

load(shared_A, d_A); // Load A values into shared

float B_col_vals[4]; // per thread column values (each block will have 32*4 values)

  float out_col_vals[3];

// Load values into B 

#pragma unroll

 for(int i = 0; i < 4; i++)

 {

B_col_vals[i] = d_B[threadIdx.x + blockIDx.x*32 + i*pitch]; // Might not unroll due to pitch being unkown at compile time => manual unrolling

}

#pragma unroll

  for(int j = 0; j < 3; j++)

 {

#pragma unroll

	for(int i = 0; i < 4; i++)

	   out_col_val[j] += B_col_vals[i]*shared_A[j][i];

}

// write to global

#pragma unroll

	for(int i = 0; i < 3; i++)

		   d_out[threadIdx.x + blockIdx.x*32 + i*pitch] = out_col_val[i];	 

}

// Please excuse any mistakes in adressing that i might have made...

Above is the raw and ugly way to do it for your specific size. Here once also needs to make sure to use cudaMallocPitch() when allocating the matrices to ensure coalescing when looping over the rows ( since num_cols is not a multiple of 32).

Ok, let me know what you think!