Matrix multiplication woes large inner, small outer dimensions

In my attempts to cuda-accelerate my calculations, I have been thwarted by cublas/sgemm. What has stopped me dead in my tracks is matrix multiplication of the type [m x n ] * [ n x k] where m and k << n. In short cublas performs miserably on matrix products where the inner dimensions are much larger than the outer dimensions. Why that is isn’t much of a mystery - the small resulting [m x k] matrix gets divided into very few blocks and a few SPs are forced to do a large amount of calculations while the rest are idle.

My first attempt at a solution was instead of using sgemm simply using vector matrix multiplication (sgemv) in a loop. I knew that it was a bad solution as it doesn’t make use of shared memory, and indeed the results were also bad.

Benchmark: CPU: Intel i7 940, GPU: GTX 295

Matrix A: [60 x 20000], Matrix B: [20000 x 20]

GPU/segmm: 5.59 ms
GPU/segmv: 2.72 ms
CPU/Matlab: 3.56 ms

So, my question is this: Is there a standard solution for fast matrix multiplication where the inner dimensions are much larger than the outer ones? I’m guessing that I’m not the first one to run into this, so I’d be interested in hearing how this can be solved efficiently.

What happens if you make the dimensions a multiple of 16 (or better, 64)? I’ve found that the speed vs size curve is has noticeable dips when that’s the case (at least for square matrices).

That won’t help. The problem is that the result matrix is small. That matrix fits into a very few blocks (with 16x16 threads per block) meaning that a small number of multiprocessors on the GPU are used and the rest are idle. This is because matrix multiplication algorithms typically work their way back from the result matrix. Small resulting matrix = small number of blocks = low occupancy.

Anyway, here are benchmark results for:
A: [64 x 4096], B: [4096 x 64]
GPU/sgemm: 1.54 ms
GPU/sgemv: 2.54 (expected as this will have less benefit the smaller the inner dimensions are)
CPU/Matlab: 1.88 ms

However, to give you an idea how it should be, a [4096 x 64] * [64 * 4096] multiplication:

GPU/sgemm: 0.18 ms
CPU/Matlab: 128 ms

This is the type of performance increase the GPU can give when given the chance to actually work at full capacity.

Sorry… I missed that point!

Could you have each block compute a single element of the result array? Each thread would multiply two numbers together, and then the entire block would do a reduction sum on the resultant products to yield the final value of the target element?

That’s not a bad idea, but in order not to run into the same problem as the per row vector-matrix multiplication shared memory would have to be utilized. This will be a problem though as even the partial results are unlikely to fit into shared memory.

Well, obviously you’d put the intermediate results into shared memory, prior to the reduction sum. To avoid running out of shared memory, rather than each thread doing a single product and storing the result in shared memory, you would have each thread do several multiplies, and accumulate the sum. This is an extension of one of the examples for the reduction example in the SDK.

Yep, I think that will work and I’m planning a very similar approach.

For an [m x k] * [ k x n] matrix:

  1. Use m*n blocks, one for each result element.
  2. 128 threads per block, each thread doing k/128 operations
  3. Read results into shared memory
  4. Sync threads
  5. Perform a reduction sum for each block

Given this scheme I think the occupancy will be very good. The reduction part should be fairly fast as it is within the same block using shared memory and won’t be of excessive size.

This comes under the heading of ‘performance tuning’ but you might also want to consider which k/128 elements get done by each thread - I suspect that you might get better coalescing if each thread does the elements given by (tid % 128) rather than the first thread doing 0-127, the next 128-255 IYSWIM (I know I’m messing the numbers up slightly). Also, in some cases it might be worth transposing one matrix, so that both can be accessed coalesced… all told, it could be an interesting task for an autotuner!

Thank you for the suggestion - it sounds plausible that it would be better in terms of coalescing. As for autotuning, well, I’d settle for a well made matrix library, if you know what I mean. Sure, rolling your own kernels can be fun, but it does take time away from developing the application that CUDA is supposed to speed up.

Now that you mention it though, on top of my primary system (which is a neural net) there is an optimization algorithm (GA). I could probably be use it to tune the kernels…

I haven’t given your problem any significant amount of thought, so I can’t offer an argument, that the following has any advantage over other approaches, expressed in this thread.

You perhaps need to multiply many pairs of matrices. Therefore, you might consider running the whole C=A*B task in one block, on one multiprocessor. At the same time, you can achieve parallelism by running a large number of blocks at once. So, I would 1) load, say, half a column of B into shared memory, one element per thread; 2) multiply A by this loaded column, having a massively unrolled loop in every thread (one thread per result element), and organizing the threads in such a way, that access to elements of A is coalesced; 3) store the product back into the RAM; 4) repeat with the other half of the column of B; 5) loop over other columns of B. It looks like the result doesn’t have to be ever stored in the shared memory, registers will do.

Anyway, I would be interested in reading which approach happened to be the best.

Good luck!

denoir, if you are still having trouble getting this working, you might try PM’ing or emailing vvolkov. He does research alongside nVidia into tuning linear algebra kernels for CUBLAS, so he might know a good way to solve this problem (which could in turn, be useful to others).

I have done some testing and the first results are disastrous (an order of magnitude slower than sgemm). The problem is coalescing (lack thereof) and the solution above becomes seriously bandwidth limited. It’s relatively easy to see why if one looks at a general example.

The multiplication that I’m trying to do is C=A[sup]T[/sup]B

Suppose we have the following matrices:

External Media

External Media

A[sup]T[/sup] is then:

External Media

and C=A[sup]T[/sup]B:

External Media

In the approach discussed earlier in this thread we had something like this from a CUDA perspective if we had 3 threads per block:

External Media

So, what’s wrong with this? The answer i that it is a very poor design in terms of memory coalescing. To get decent coalescing all of T0 would have to be executed in parallel and the same for T1…TN. The example above is so small that it of course wouldn’t matter, but if you have instead of a [2x5] * [5 x3] multiplication a [60x20000] * [20000x20], then it becomes obvious why the approach above is bad.

What is desired is instead something like this (if we ignore the final reduction sum and just look at memory coalescence):

External Media

Note: I’ve made some errors with indices in the picture above and subsequent pictures. They are not systematic errors so they should not matter.

Or, illustrated with one calculation per thread and three threads per block:

External Media

Alright, let’s look at the physical reality GPU level (figures for GTX 280/295) for 128 threads per block:

    [*]128 threads per block

    [*]8 blocks per multiprocessor

    [*]30 multiprocessors

That means that I can execute 308=240 blocks at once. These should preferably all fetch the same subset of the data, which is possible with the scheme described above. So the first run would be (assuming one operation per kernel/thread) 240128 = 30720 active threads.

Since we make a sum of two products each kernel (I think) we have (mnk)/2 elements to process. If we suppose the n=60,n=20,k=20000 matrix mentioned above we would ideally want to handle nm elements in parallel, k/2 times. As it happens 30720> nm =1200 so if one insisted on just running it with one sum of two products per thread, one would get a much lower occupancy (~4%) as not all threads would have work to do. The simple solution is to use more blocks. How many? Simple 30720/(n*m) which in the case above is ~25 times more. Of course by increasing the number of different memory locations one wishes to access, the probability of coalescence decreases, so I suppose it is something one would have to test for.

Also, I can most likely use shared memory well in this approach. Half of the data (per block) will be used multiple times. Finally, there is the question of storing the results and summing the element in the right fashion. There are a few options there, the most simple is perhaps simply that each kernel adds its sum in the right location.

That is a secondary concern which I will deal with later. Right now I want to get the memory fetching correct.

Does the reasoning above make sense to you or have I missed something?

Perhaps it is not coalescing per se, but the total number of reads from memory. If the matrices are [m x k] * [ k x n] and each block is computing one output value, then that means each block is loading 2k values, and the total amount of reading is 2mn*k.

If [m x n] is small enough for shared memory to accumulate a [m x n] partial output using only some of the 20000 rows, then that could save a lot of bandwidth. Each block could load, say, 32 rows at a time of A and B into shared memory, and accumulate the [m x 32] * [32 x n] product into a partial result. Then at the end, the [m x n] partial results from each block could be added together pretty quickly. The total number of reads from global memory would be only (m+n)k, which is much less than 2mnk. For m=60 and n=20, it’s a factor of 30.

The shared memory requirement might mean that fewer than 8 blocks can reside on a multiprocessor, but with the decrease in global memory bandwidth, I am guessing it would be well worth it.

Jamie, the problem in that case would be that you have the partial results in many different blocks and you need to write them down to global memory. Also, shared memory becomes a problem. With matrices where m and n are fairly large it simply won’t fit into shared memory.

I don’t think I made myself clear, the blocks are few enough that the final aggregation is quick, but there are enough of them to keep all the processors busy. The case m=60 and n=20 is small enough to fit in shared memory. If m and n are too large, then they could be cut into pieces, but doing so increases the total global memory bandwidth required. But the point I was trying to make is that slicing the problem along the ‘k’ dimension does not increase the bandwidth needed, except for the final accumulation which is small if m and n are small and k is cut into less than say 100 pieces.

I’m curious now, I think I may put together an implementation and see.

I would be very interested in seeing the result. If anybody is interested, here’s what I’ve got so far. Not good enough though. I get an execution time (for both) of ~2 ms for my standard benchmark matrix of n=60, m=20, n= 20224:



include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil_inline.h>

// includes, kernels

#include <>


// declaration, forward

void runTest( int argc, char** argv);

extern “C”

void computeGold( float* reference, float* idata, const unsigned int len);


// Program main



main( int argc, char** argv)


runTest( argc, argv);

cutilExit(argc, argv);


void runTest( int argc, char** argv)


// use command-line specified CUDA device, otherwise use device with highest Gflops/s

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

	cutilDeviceInit(argc, argv);


	cudaSetDevice( cutGetMaxGflopsDeviceId() );

// c = aT * b

unsigned int m = 60;

unsigned int k = 256*79;

unsigned int n = 20;

unsigned int reps = 100;

Matrix h_a, d_a, h_b, d_b, h_c, d_c;

h_a.R = k;

h_a.C = m;

d_a = h_a;



for(unsigned int i = 0; i < h_a.R * h_a.C; i++)

    h_a.E[i] = (i+1);//100.0;

cudaMemcpy(d_a.E, h_a.E, h_a.size, cudaMemcpyHostToDevice);

h_b.R = k;

h_b.C = n;

d_b = h_b;



for(unsigned int i = 0; i < h_b.R * h_b.C; ++i)

    h_b.E[i] = (i+1);//100.0;

cudaMemcpy(d_b.E, h_b.E, h_b.size, cudaMemcpyHostToDevice);

h_c.R = m;

h_c.C = n;

d_c = h_c;



dim3  grid4(d_c.C, d_c.R);

dim3  threads4(BLOCK_DIMX);

aTbKernel<<< grid4, threads4, 128*4>>>(d_a.E, d_b.E, d_c.E, d_a.C, d_b.C, d_b.R);


unsigned int timer2  = 0;



for(uint i = 0; i < reps; ++i)


	aTbKernel<<< grid4, threads4, 128*4>>>(d_a.E, d_b.E, d_c.E, d_a.C, d_b.C, d_b.R);





printf("Processing time aTb: %f (ms) \n", cutGetTimerValue(timer2)/reps);


dim3  grid(d_c.C/2, d_c.R/2);

dim3  threads(BLOCK4_DIMX);

aTb4Kernel<<< grid, threads, BLOCK4_DIMX*4>>>(d_a.E, d_b.E, d_c.E, d_a.C, d_b.C, d_b.R);


unsigned int timer = 0;



for(uint i = 0; i < reps; ++i)


	aTb4Kernel<<< grid, threads, BLOCK4_DIMX*4>>>(d_a.E, d_b.E, d_c.E, d_a.C, d_b.C, d_b.R);





printf("Processing time aTb4: %f (ms) \n", cutGetTimerValue(timer)/reps);

cudaMemcpy(h_c.E, d_c.E, h_c.size, cudaMemcpyDeviceToHost);

// for(unsigned int r = 0; r < h_c.R; r++)


//	for(unsigned int c = 0; c < h_c.C; c++)

//		printf("%f\t", h_c.E[r + h_c.R*c]);

//	printf("\n");



//float q = 22786700.0f;

//printf("\n%f", q);




general.h <—needed include file


#define uint unsigned int

#define BLOCK_DIMX 128

#define BLOCK4_DIMX 256

typedef struct


uint L; 

uint size; 

float* E; 

} Vector;

inline void malocVector(Vector* v)


v->size = v->L*sizeof(float);

v->E = (float*) malloc(v->size);


inline void cudaMalocVector(Vector* v)


v->size = v->L*sizeof(float);

cudaMalloc( (void**) &v->E, v->size);


inline void freeVector(Vector* v)



v->E = 0;


inline void cudaFreeVector(Vector* v)



v->E = 0;


typedef struct


uint R; 

uint C; 

uint pitch;

uint size;

float* E; 

} Matrix;

inline void malocMatrix(Matrix* m)


m->size = m->R * m->C * sizeof(float);

m->E = (float*) malloc(m->size);


inline void cudaMalocMatrix(Matrix* m)


m->size = m->R * m->C * sizeof(float);

cudaMalloc( (void**) &m->E, m->size);


inline void freeMatrix(Matrix* m)



m->E = 0;


inline void cudaFreeMatrix(Matrix* m)



m->E = 0;



Output: (GTX 295):

Processing time aTb: 1.970397 (ms)

Processing time aTb4: 2.035300 (ms)

I would need to get this an order of magnitude down for it to be acceptable. As I mentioned earlier an Intel i7 does this operation in 3.6 ms

Okay, so I got something I think is worth sharing.

I first wrote a driver that runs 4 implementations and measures the time and correctness of the results: 1. host version, 2. sample version (from sdk/programming guide), 3. cublas version, 4. denoir version.

Then I began on my own “subdivide on k” implementation.

First let me say my original implementation where each block computes a full [m x n] in shared memory using only some of the k rows did not perform well at all. It took about 4x as long as the denoir implementation, and I believe part of the reason is the very low occupancy (one block per multiprocessor). I might have tried to optimize it (bank conflicts, etc) but it has some inherent problems and it cannot run at all when m and n exceed the shared memory, which will happen even when m and n are not that big.

The matrix multiply example in the programming guide subdivides the problem on m and n, but not on k. I decided to write an impementation that subdivides on m and n and k, so that when m and n are small, it can still launch many blocks. It subdivides m into pieces of SUBDIV_M, and n into SUBDIV_N, and with k, each block processes SUBDIV_K at a time, but will accumulate much more than that. The entire k is divided into Q roughly equal size pieces, and each block processes the k/Q rows SUBDIV_K at a time.

My implementation requires m and n to be multiples of SUBDIV_M and SUBDIV_N, so I am running a case where m=64, n=16, and k = 256*79

On my 9800GTX:

host mmult time: 46.735200 (ms)

simple mmult time: 2.007814 (ms)

Total Errors = 0

cublas mmult time: 4.031665 (ms)

Total Errors = 0

denoir mmult time: 2.885172 (ms)

Total Errors = 0

jamie mmult time: 4.738407 (ms)

Total Errors = 0

jamie2 mmult time: 1.168706 (ms)

Total Errors = 0

I in another thread, vvolkov mentioned that “shared memory is substantially slower than registers” so I also attempted to minimize accesses to shared memory. Each block loads the matrix data into shared memory, and by further subdividing the block and keeping things in registers, I average 1 shared memory access per madd instead of 4. I think this also makes it easier to avoid bank conflicts.

Here it is:


#include “ctimer.h”

// each block produces a 16x16 chunk of output, but doesn’t necessarily scan through the entire K rows.

static const int SUBDIV_M = 16;

static const int SUBDIV_N = 16;

static const int SUBDIV_K = 16;

static const int DESIRED_BLOCKS = 256;

static const int ACCWIDTH = SUBDIV_M+1;

#define SUBSUB_M 2

#define SUBSUB_N 2


// function to accumulate for the 2x2 sub-sub-block

device void ssacc_2x2(float acc[ACCWIDTH], int nrow_offs, int mcol_offs, int loader_col, float s_B[SUBDIV_K], float s_A[SUBDIV_K]) {

float ssa_0_0 = 0;

float ssa_0_1 = 0;

float ssa_1_0 = 0;

float ssa_1_1 = 0;

for (int i=0; i < SUBDIV_K; i++) {

	int kk = (i+loader_col) % SUBDIV_K;

	float ssB_0 = s_B[0+nrow_offs][kk];

	float ssB_1 = s_B[1+nrow_offs][kk];

	float ssA = s_A[0+mcol_offs][kk];

	ssa_0_0 += ssB_0 * ssA;

	ssa_1_0 += ssB_1 * ssA;

	ssA = s_A[1+mcol_offs][kk];

	ssa_0_1 += ssB_0 * ssA;

	ssa_1_1 += ssB_1 * ssA;


acc[0 + nrow_offs][0 + mcol_offs] += ssa_0_0;

acc[0 + nrow_offs][1 + mcol_offs] += ssa_0_1;

acc[1 + nrow_offs][0 + mcol_offs] += ssa_1_0;

acc[1 + nrow_offs][1 + mcol_offs] += ssa_1_1;


// The host will launch M/SUBDIV_M x N/SUBDIV_N x Q blocks, where Q is determined at run time (may be 1)

// dim3 theGrid(M/SUBDIV_M, N/SUBDIV_N * Q);

global void aTbKernel3(float *g_A, float *g_B, float *g_C, int M, int N, int K, int Q) {

int starting_m = blockIdx.x * SUBDIV_M;

g_A += starting_m * K;

int starting_n = (blockIdx.y / Q) * SUBDIV_N;

g_B += starting_n * K;

int qofQ = blockIdx.y % Q;

g_C += qofQ * M * N;

int k_per_Q = (K + Q - 1)/Q;

k_per_Q = (k_per_Q + SUBDIV_K - 1)/SUBDIV_K * SUBDIV_K;

int starting_k = qofQ * k_per_Q;

int ending_k = starting_k + k_per_Q;

if (ending_k > K) {

	ending_k = K;


int nrow_offs = threadIdx.y * SUBSUB_N;

int mcol_offs = threadIdx.x * SUBSUB_M;

// naive clearing of accum matrix (has bank conflicts)

__shared__ float accum[SUBDIV_N][ACCWIDTH];

for (int nrow=0; nrow < SUBSUB_N; nrow++) {

	for (int mcol=0; mcol < SUBSUB_M; mcol++) {

		accum[nrow + nrow_offs][mcol + mcol_offs] = 0;



__shared__ float s_A[SUBDIV_M][SUBDIV_K];

__shared__ float s_B[SUBDIV_N][SUBDIV_K];

int tid = blockDim.x*threadIdx.y + threadIdx.x;  // blockDim.x should be equal to SUBDIV_M

int loader_col = tid % SUBDIV_K;

int loader_startrow = tid / SUBDIV_K;

for (int current_k=starting_k; current_k < ending_k; current_k += SUBDIV_K) {

	// load SUBDIV_K columns of A into s_A

	for (int row=loader_startrow; row < SUBDIV_M; row += NTHREADS/SUBDIV_K) {

		s_A[row][loader_col] = g_A[row*K+current_k+loader_col];


	// load SUBDIV_K columns of B into s_B

	for (int row=loader_startrow; row < SUBDIV_N; row += NTHREADS/SUBDIV_K) {

		s_B[row][loader_col] = g_B[row*K+current_k+loader_col];



	if (true) {

		ssacc_2x2(accum, nrow_offs, mcol_offs, loader_col, s_B, s_A);


	else if (false) {

		// arrays do not get optimized to registers, even though (with unrolling) all indices would be constant

		float ssaccum[SUBSUB_N][SUBSUB_M];

		for (int nn=0; nn < SUBSUB_N; nn++) {

			for (int mm=0; mm< SUBSUB_M; mm++) {

				ssaccum[nn][mm] = 0;



		for (int i=0; i < SUBDIV_K; i++) {

			int kk = (i+loader_col) % SUBDIV_K;

			float ss_BN[SUBSUB_N];

			for (int nn=0; nn < SUBSUB_N; nn++) {

				ss_BN[nn] = s_B[nn+nrow_offs][kk];


			for (int mm=0; mm< SUBSUB_M; mm++) {

				float mval = s_A[mm+mcol_offs][kk];

				for (int nn=0; nn < SUBSUB_N; nn++) {

					ssaccum[nn][mm] += mval*ss_BN[nn];




		for (int nrow=0; nrow < SUBSUB_N; nrow++) {

			for (int mcol=0; mcol < SUBSUB_M; mcol++) {

				accum[nrow + nrow_offs][mcol + mcol_offs] += ssaccum[nrow][mcol];




	else {

		// naive summing into accum, lots of shared access, and bank conflicts

		for (int nrow=0; nrow < SUBSUB_N; nrow++) {

			for (int mcol=0; mcol < SUBSUB_M; mcol++) {

				float total = 0;

				for (int kcol=0; kcol < SUBDIV_K; kcol++) {

					total += s_A[mcol + mcol_offs][kcol]*s_B[nrow + nrow_offs][kcol];


				accum[nrow + nrow_offs][mcol + mcol_offs] += total;





// copy accumulated buffer to output

for (int nrow=0; nrow < SUBSUB_N; nrow++) {

	for (int mcol=0; mcol < SUBSUB_M; mcol++) {

		g_C[M*(nrow + nrow_offs + starting_n) + mcol + mcol_offs + starting_m] = accum[nrow + nrow_offs][mcol + mcol_offs];




global void sumC(float *g_C, int M, int N, int Q) {

int threadm = blockIdx.x*blockDim.x+threadIdx.x;

int threadn = blockIdx.y*blockDim.y+threadIdx.y;

float sum = 0;

for (int q=0; q < Q; q++) {

	sum += g_C[q*M*N + M*threadn + threadm];


g_C[M*threadn + threadm] = sum;


double mmult_jamie2(float* C, const float* At, const float* B, unsigned int hA, unsigned int wA, unsigned int wB) {

unsigned int m = hA;

unsigned int k = wA;

unsigned int n = wB;

int hC = hA;

int wC = wB;

int mnblocks = (m/SUBDIV_M) * (n/SUBDIV_N);

int Q = (DESIRED_BLOCKS + mnblocks - 1)/mnblocks;

float *tC = (float *)malloc(hC*wC*sizeof(float)*Q);

Matrix d_A, d_B, d_C;

d_A.R = k;

d_A.C = m;

d_B.R = k;

d_B.C = n;

d_C.R = m*Q;

d_C.C = n;




cudaMemcpy(d_A.E, At, d_A.size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B.E, B, d_B.size, cudaMemcpyHostToDevice);

cudaMemset(d_C.E, 0, d_C.size);

dim3 theGrid(m/SUBDIV_M, n/SUBDIV_N * Q);


dim3 sumGrid(m/SUBDIV_M, n/SUBDIV_N);

dim3 sumThreadBlock(SUBDIV_M, SUBDIV_N);

struct ctimer timer1;


for (int i=0; i < ITERATIONS; i++) {

	aTbKernel3<<< theGrid, threadBlock >>>(d_A.E, d_B.E, d_C.E, d_A.C, d_B.C, d_B.R, Q);


	sumC<<< sumGrid, sumThreadBlock >>>(d_C.E, d_A.C, d_B.C, Q);



double elapsed = stop(&timer1)/ITERATIONS;

cudaMemcpy(tC, d_C.E, d_C.size, cudaMemcpyDeviceToHost);

// transpose tC into C

for (int r=0; r < hC; r++) {

	for (int c=0; c < wC; c++) {

		C[r*wC+c] = tC[c*hC+r];







return elapsed;



I experimented with different values for the parameters at the top, and the values are about the best on my 9800GTX. Larger SUBDIV_M and SUBDIV_N should offer lower occupancy due to shared memory needs, but lower overall bandwidth needs due to fewer total loads from global memory. Larger SUBSUB_M and SUBSUB_N requires fewer shared memory loads within the sub-sub accumulation, but needs more registers, and results in fewer threads per block, again lowering threads per multiprocessor.

I haven’t calculated what a bandwidth-limited theoretical best would be. I suspect it’s not close, but would make an interesting comparison.

Not bad - much better than I expected from that approach, but I’ll still have to say: close but no cigar.

The first problem is that the results of the matrix multiplication are not correct. It’s probably some simple bug so I’m assuming a correction will not affect performance.

The second problem is that it scales badly - for small outer dimensions, mmult_jamie2 is problematic. The minimum matrix size is controlled by SUBDIV_M, SUBDIV_N, SUBDIV_K. If you reduce them you also reduce occupancy and the performance drops. If you don’t and run on just one block per m and n, the performance drops as well.

Example: m=n=8, k = 256*79

Processing time denoir1: 0.166842 (ms)
Processing time denoir2: 0.092548 (ms)
Processing time mmult_jamie2: 0.446640 (ms)

CUBLAS/sgemm: 4.30025 (ms)
MATLAB (CPU): 0.326591 (ms)

For larger m, n both our approaches are inferior to CUBLAS:

Processing time denoir1: 63.006447 (ms)
Processing time denoir2: 32.159732 (ms)
Processing time mmult_jamie2: 38.648174 (ms)

CUBLAS/sgemm: 12.70073 (ms)
MATLAB (CPU): 90.99132 (ms)

(The denoir2 is a new kernel that I’m developing on the base of previous versions. It’s about twice as fast as denoir1 and 10% faster than the mmult_jamie2 on the standard benchmark m=64, n=16, and k = 256*79. I still have some optimization to do before I’m ready to share it.)

The bottom line is that mmult_jamie2 performs quite well, but is only applicable on a narrow band of m and n. They have to be large enough to work with the subdivision and small enough not to be outperformed by CUBLAS. If optimized to avoid bank conflicts and similar stuff, it’s quite possible that mmult_jamie2 can beat denoir2, but again, the problem is that it can only be used for a small range of matrices.

I don’t really disagree with you.

The primary point was to demonstrate a benefit to higher calculation per bandwidth, while trying not to get killed by some other factor such as occupancy, coalescing, bank conflicts, etc.

For small m and n, there are at the very most m*n/(m+n) madds per load. For very small m and n, this number is in the low single digits, meaning an optimal implementation for that size is definitely going to be bandwidth-limited. So I would argue that it’s even more important to minimize the number of loads by subdividing on k instead of m and n. For larger m and n, the block size should strive to balance the numerical power with the bandwidth. Something CUBLAS clearly does much better than I do.

I have a suspicion that CUBLAS also has a better data layout. It’s interesting to me that denoir2 and jamie2 are much closer to each other than to CUBLAS. Speaking of which, do you know where to find the source to CUBLAS? I thought it had been published, but I can’t seem to find it.

The source for CUBLAS has not been released yet, but we’ve been promised that it will. Until then you can check out vvolkovs code which is almost as fast as CUBLAS/sgemm 2.1.

I ran some more systematic benchmarks on sgemm vs denoir1 and got some interesting results (neither denoir2 nor mmult_jamie2 can handle arbitrary matrix sizes so I used the kernel that does):

aTb5_Kernel vs CUBLAS m,n = 1:256, k = 256*79:

External Image

External Image

External Image

External Image

(same as above, but in 3d)

External Image