Matrix multiplication CUDA

TomBaA · November 12, 2012, 11:02am

Hello everyone,

have one question about CUDA. I am still new here, so don’t mind my question.
I want to multiply two matrices on GPU, each thread calculating one element of the resulting matrix. First I do standard multiplication, i.e. rows of the first matrix times columns of the second matrix. Then I transpose second matrix and therefore multiply rows of the first matrix times rows of the second matrix.
Since C/CUDA is row major, second approach should be faster, right?
Not in my case - first case is >3x faster (I only time kernel). How come?

Any help is appreciated.

Here is the code;

/*-------------------------------*/
/* first version - A x B = C  */
/* A,B and C are 1D arrays */
__global__ void ... {
        float Cvalue = 0.0f;
        int row = blockIdx.y * blockDim.y + threadIdx.y;   // row = 0,1,...,C.height-1
        int col = blockIdx.x * blockDim.x + threadIdx.x;   // col = 0,1,...,C.width-1
        for (int e = 0; e < A.width; e++)
                Cvalue += A.ptr[row * A.width + e] * B.ptr[e * B.width + col];
        C.ptr[row * C.width + col] = Cvalue;
}
 
/*------------------------------*/
/* second version - A x B^T = C  */
__global__ void ... {
        float Cvalue = 0.0f;
        int row = blockIdx.y * blockDim.y + threadIdx.y;
        int col = blockIdx.x * blockDim.x + threadIdx.x;
        for (int e = 0; e < A.width; e++)
                Cvalue += A.ptr[row * A.width + e] * B.ptr[e + B.width * col];
        C.ptr[row * C.width + col] = Cvalue;
}

p.s. I don’t have profiler at this moment so can’t see what is happening

eyalhir74 · November 12, 2012, 1:05pm

Take a look at the relevant SDK sample that does that efficently.
For one your loop in the kernel is not coallesced which is probably where
your performance gets a hit the most.

eyal

pasoleatis · November 12, 2012, 1:15pm

Hello,

For this problem you need to use the shared memory in an efficient way. For example you could bring save in the share memory the row from A. Something similar to this problem http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.

First check if access the memory contiguous. Second, in your code check for example if could save for in each block a row of matrix A and then reuse it. This means you have A.length threads running in parallel and then a for loop.

Put this line before the loop, maybe helps a little:

#pragma unroll

TomBaA · November 12, 2012, 1:49pm

thanks, guys.

@ pasoleatis
#pragma unroll doesn’t help here…
i am aware that shared memory should be used here, but i wanted to try without it…
how do i check if accesses to the memory are contiguous?

@eyalhir
why is loop in kernel uncoallesced (you mean the second example)? in the first example i keep consecutively taking A.width elements of A and skipping elements of B, so this should be uncoallesced; in the second example, again i keep consecutively taking A.width elements of A but now also B.width elements of B, so this should be coallesced, no?

thank you!

eyalhir74 · November 12, 2012, 2:19pm

Both samples are not coallesced.

for (int e = 0; e < A.width; e++)
    Cvalue += A.ptr[row * A.width + e] * B.ptr[e + B.width * col];

The problem is in the “+ e” that you do. Thread 0 will access items:
row * A.width + 0, row * A.width + 1, row * A.width + 2,… , row * A.width + A.width - 1

This is non coallescing access.

You want consecutive threads accessing consecutive memory locations and not the way
you do in your code.
Take a look at the transpose sample in the SDK.

eyal

TomBaA · November 12, 2012, 2:30pm

I will check SDK, just one more thing since we started this here…
say A.width is 128, meaning that I need 4 warps to access this one line of matrix A.
using my code, i.e.

A.ptr[row * A.width + e]

and say row=0 (so threads (0,0) to (0,127) will access this 128 elements), then warp1 will access A(0:31), warp2 A(32:63), warp3 A(64:95) and warp4 A(96:127)…I don’t understand what is uncoallesced here…

Thank you…

eyalhir74 · November 12, 2012, 2:56pm

Coallescing is done in the thread level not warp level, therefore its not coallesced

eyal

pasoleatis · November 12, 2012, 3:04pm

Hello,

Yes. I was referring to coalesced accesses. The memory requests are made per warp. The consecutive threads have the treadIx.x consecutive. This problem is dominated by the global memories accesses, this is why you need to use the shared memory in order to minimize the memory acceses and also you need to use coalesced memory.

In your code

A.ptr[row * A.width + e]

the memory access is not coalesced, because for a given e=0 thread 0 is asking for element 0, thread 1 for element A.width, thread 2 element 2A.width and sow on the threadIx.x will ask for threadIx.xA.width.

Even if you do not use the share memory, if you fix this you might get 30 times speed up compared to what you have now.

Topic		Replies	Views
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2926	October 30, 2020
Cuda matrix-multiply, memory coalescing (corner turning) Legacy PGI Compilers cuda	2	1441	June 9, 2021
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2166	October 7, 2008
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	477	February 5, 2014
Cuda matrix-multiply, memory coalescing (corner turning) CUDA Programming and Performance	0	806	June 9, 2021
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10733	February 8, 2008
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8038	June 24, 2009
Vector-Matrix Multiplication Is this a fast kernel? CUDA Programming and Performance	5	6740	April 19, 2010
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1851	October 5, 2018
Pls help - Matrix multiplication CUDA Programming and Performance	0	727	February 9, 2011

Matrix multiplication CUDA

Related topics