Matrix multiplication CUDA

Hello everyone,

have one question about CUDA. I am still new here, so don’t mind my question.
I want to multiply two matrices on GPU, each thread calculating one element of the resulting matrix. First I do standard multiplication, i.e. rows of the first matrix times columns of the second matrix. Then I transpose second matrix and therefore multiply rows of the first matrix times rows of the second matrix.
Since C/CUDA is row major, second approach should be faster, right?
Not in my case - first case is >3x faster (I only time kernel). How come?

Any help is appreciated.

Here is the code;

/*-------------------------------*/
/* first version - A x B = C  */
/* A,B and C are 1D arrays */
__global__ void ... {
        float Cvalue = 0.0f;
        int row = blockIdx.y * blockDim.y + threadIdx.y;   // row = 0,1,...,C.height-1
        int col = blockIdx.x * blockDim.x + threadIdx.x;   // col = 0,1,...,C.width-1
        for (int e = 0; e < A.width; e++)
                Cvalue += A.ptr[row * A.width + e] * B.ptr[e * B.width + col];
        C.ptr[row * C.width + col] = Cvalue;
}
 
/*------------------------------*/
/* second version - A x B^T = C  */
__global__ void ... {
        float Cvalue = 0.0f;
        int row = blockIdx.y * blockDim.y + threadIdx.y;
        int col = blockIdx.x * blockDim.x + threadIdx.x;
        for (int e = 0; e < A.width; e++)
                Cvalue += A.ptr[row * A.width + e] * B.ptr[e + B.width * col];
        C.ptr[row * C.width + col] = Cvalue;
}

p.s. I don’t have profiler at this moment so can’t see what is happening

Take a look at the relevant SDK sample that does that efficently.
For one your loop in the kernel is not coallesced which is probably where
your performance gets a hit the most.

eyal

Hello,

For this problem you need to use the shared memory in an efficient way. For example you could bring save in the share memory the row from A. Something similar to this problem http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.

First check if access the memory contiguous. Second, in your code check for example if could save for in each block a row of matrix A and then reuse it. This means you have A.length threads running in parallel and then a for loop.

Put this line before the loop, maybe helps a little:

#pragma unroll

thanks, guys.

@ pasoleatis
#pragma unroll doesn’t help here…
i am aware that shared memory should be used here, but i wanted to try without it…
how do i check if accesses to the memory are contiguous?

@eyalhir
why is loop in kernel uncoallesced (you mean the second example)? in the first example i keep consecutively taking A.width elements of A and skipping elements of B, so this should be uncoallesced; in the second example, again i keep consecutively taking A.width elements of A but now also B.width elements of B, so this should be coallesced, no?

thank you!

Both samples are not coallesced.

for (int e = 0; e < A.width; e++)
    Cvalue += A.ptr[row * A.width + e] * B.ptr[e + B.width * col];

The problem is in the “+ e” that you do. Thread 0 will access items:
row * A.width + 0, row * A.width + 1, row * A.width + 2,… , row * A.width + A.width - 1

This is non coallescing access.

You want consecutive threads accessing consecutive memory locations and not the way
you do in your code.
Take a look at the transpose sample in the SDK.

eyal

I will check SDK, just one more thing since we started this here…
say A.width is 128, meaning that I need 4 warps to access this one line of matrix A.
using my code, i.e.

A.ptr[row * A.width + e]

and say row=0 (so threads (0,0) to (0,127) will access this 128 elements), then warp1 will access A(0:31), warp2 A(32:63), warp3 A(64:95) and warp4 A(96:127)…I don’t understand what is uncoallesced here…

Thank you…

Coallescing is done in the thread level not warp level, therefore its not coallesced

eyal

Hello,

Yes. I was referring to coalesced accesses. The memory requests are made per warp. The consecutive threads have the treadIx.x consecutive. This problem is dominated by the global memories accesses, this is why you need to use the shared memory in order to minimize the memory acceses and also you need to use coalesced memory.

In your code

A.ptr[row * A.width + e]

the memory access is not coalesced, because for a given e=0 thread 0 is asking for element 0, thread 1 for element A.width, thread 2 element 2A.width and sow on the threadIx.x will ask for threadIx.xA.width.

Even if you do not use the share memory, if you fix this you might get 30 times speed up compared to what you have now.