Multiplying Rectangular Matrices

shyam.pm · March 29, 2007, 4:09am

Hi,

I am doing a computation which requires the multiplication of two rectangular matrices, in fact a matrix and its transpose… (3xn)*(nx3). If I have a block size of 40X3 my threadIdx.x say tx would go from 0-40 and threadIdx.y say ty from 0-3. While multiplying these matrices I would write the code as

(int k = 0; k < blockDim.x; ++k)

  Csub += A(k,ty)* At(tx,k);

__syncthreads();

The result obviously is a 3x3. But the problem with the code above is the index tx would go beyond 3 which would result in unnecessary computation. This is proving to be computationally expensive in my implementation.

Would it be possible to use syncthreads() and still stop the index from getting incremented beyond 3? Thanks in advance for any help.

Shyam

paulius · March 29, 2007, 1:52pm

First, your tx and ty should be indices into the product matrix. Thus, in your case, both should vary between 0 and 2, inclusively (since your output is 3x3).

Second, such parallelization will be very inefficient since only 9 threads will be active. That’s not enough to fill up even a single threadblock. You should also parallelize the inner loop (the one that’s indexed by k). Think of it this way - you will have to compute a total of 3x3x40 scalar products and same number of adds. All the products can be computed in parallel since there are no dependencies (you may have to consider shared memory banking, though). After that, you’re left with 3x3 parallel sums computations, which you can do efficiently along the lines of prefix-sums/scan implementation lines.

Still, your input size is probably still too small to get a significant speedup out of the hardware. However, if this is part of a much larger CUDA computation, the goal then becomes avoiding any syncing with CPU, which will help.

Paulius

Topic		Replies	Views
3x3 Matrix multiply giving incorrect result CUDA Programming and Performance	2	434	April 10, 2023
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8837	September 11, 2009
Advice on simple multiple matrix multiplications .... CUDA Programming and Performance	2	1206	April 4, 2010
Matrix Multiply anyone can help me ? CUDA Programming and Performance	0	1490	June 18, 2008
Operation result depend on number of threads? CUDA Programming and Performance	2	470	May 6, 2014
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2387	June 22, 2009
matrix multiplication benchmark CUDA Programming and Performance	8	4423	May 21, 2010
Matrix multiplication CUDA Programming and Performance	7	2157	July 2, 2010
32 x 32 Matrix Multiplication CUDA Programming and Performance	2	2871	March 5, 2010
Multiplying arbitrary sized matrices CUDA Programming and Performance	3	2058	February 2, 2010

Multiplying Rectangular Matrices

Related topics